Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Remove duplicate entries #66

Merged
merged 16 commits into from
Mar 18, 2019
Merged

Conversation

mrengler
Copy link
Contributor

Types of changes

  • New feature (non-breaking change which adds functionality)

Description

This change introduces a remove_duplicate_entries() function, which takes the final list of fatality entries as a parameter and returns the list with any entries with duplicate case numbers removed. This relies on case number being a unique identifier. When two entries have the same case number, the entries are merged into one, where for any conflicting entry the field of the second one (more recent) will take precedence.

Some of the cases appear multiple times in the current output, leading to an incorrect overall count, so this change attempts to fix that issue.

I tested it by adding a test to tests/core/test_apd (test_remove_duplicate_entries()) and by examining the output and deleted entries over several time periods. I am still working on the tests for this change and would like to ensure correctness for total fatalities in 2018.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

#48

Fixes: #48

@mrengler
Copy link
Contributor Author

There are supposed to be 73 fatal crashes in 2018, and without duplicates we return 72. The de-duplication changes made here appear to be correct; I think we are missing the 30th fatal crash of the year, which I cannot locate a valid link for. The order of entries returned goes from 29 to 31.

In another note, the changes I would like to make to this PR are 1) de-duplication ruins the order of cases right now, so reorder before the list is returned, and 2) to use the test scenarios instead of the test function.

@rgreinho
Copy link
Member

Nicely done, but it can be even simpler!

Here is what I have done:

diff --git a/scrapd/core/apd.py b/scrapd/core/apd.py
index 8848789..2d0973b 100644
--- a/scrapd/core/apd.py
+++ b/scrapd/core/apd.py
@@ -447,7 +447,7 @@ async def fetch_and_parse(session, url):

 async def async_retrieve(pages=-1, from_=None, to=None):
     """Retrieve fatality data."""
-    res = []
+    res = {}
     page = 1
     has_entries = False
     no_date_within_range_count = 0
@@ -500,7 +500,8 @@ async def async_retrieve(pages=-1, from_=None, to=None):
                     break

                 # Otherwise store the results.
-                res += entries_in_time_range
+                for entry in entries_in_time_range:
+                    res.setdefault(entry.get(Fields.CASE), {}).update(entry)

             # Stop if there is no further pages.
             if not has_next(news_page):
@@ -511,6 +512,4 @@ async def async_retrieve(pages=-1, from_=None, to=None):

             page += 1

-    res = remove_duplicate_entries(res)
-
-    return res, page
+    return list(res.values()), page

Only 4 lines, the remove_duplicate_entries() function is not needed, all your tests are passing, and it reads very well 📖

Let me know what you think.

@rgreinho
Copy link
Member

Thinking about it a bit more, the update() would update the newest entry with the oldest, which is probably not what we want...

Instead, this would keep the first entry added:

for entry in entries_in_time_range:
    if entry.get(Fields.CASE) not in res:
        res[Fields.CASE] = entry

@mrengler
Copy link
Contributor Author

Ah, I see! If we don't break out removing duplicates as a separate function, should we just remove the test_remove_duplicate_entries test, since ensure_results will compare the entry_count over this time period anyway?

@rgreinho
Copy link
Member

Ah yes, the test will be unused, so you can remove it as well.

@mrengler
Copy link
Contributor Author

So, the flake8 linter really doesn't like async_retrieve right now lol. With the if-statement it's just too complex, so I added them in reverse order instead, with newer overwriting older. But this is perhaps not as clear, so we could also either ignore flake8 or break out more of async_retrieve into a separate function.

@rgreinho
Copy link
Member

Hehe, very resourceful! I like that! 👍

However, you're making a full copy of the list in reversed order just to please flake8. This does not seem very efficient.

What I would propose you is:

  1. to adjust the flake8 complexity to whatever you need (https://github.com/scrapd/scrapd/blob/master/setup.cfg#L59)
  2. open an issue to remind us to simplify async_retrieve() or explode it in sub-functions

@mrengler mrengler changed the title Remove duplicate entries(WIP) Remove duplicate entries Mar 18, 2019
Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! 👍

@mergify mergify bot merged commit c48d72f into scrapd:master Mar 18, 2019
mrengler added a commit to mrengler/scrapd that referenced this pull request Mar 20, 2019
* Added remove_duplicate_entries() function, called before returning list of entries in time frame.

* Added test to remove duplicate entries.

* Formatted remove_duplicate_entries test.

* Sort to preserve original order in de-duplication.

* Added changes to CHANGELOG.

* Changed structure of tests to align with pytest parameterize.

* Removed forgotten debugging printouts.

* Fix total number of 2018 entries, simplify remove_duplicate_entries function,
test length only since order of dicts is not guaranteed.

* Fixed formatting / linting errors.

* Removed remove_duplicate_entries to contain within async_retrieve, removed extra test.

* Removed unused remove_duplicate_entries code.

* Made flake8 allow extra layer of complexity, added if-statement logic to res update.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove duplicate entries
2 participants