Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate entries #66

Merged
merged 16 commits into from Mar 18, 2019

Conversation

Projects
None yet
2 participants
@mrengler
Copy link
Contributor

commented Mar 16, 2019

Types of changes

  • New feature (non-breaking change which adds functionality)

Description

This change introduces a remove_duplicate_entries() function, which takes the final list of fatality entries as a parameter and returns the list with any entries with duplicate case numbers removed. This relies on case number being a unique identifier. When two entries have the same case number, the entries are merged into one, where for any conflicting entry the field of the second one (more recent) will take precedence.

Some of the cases appear multiple times in the current output, leading to an incorrect overall count, so this change attempts to fix that issue.

I tested it by adding a test to tests/core/test_apd (test_remove_duplicate_entries()) and by examining the output and deleted entries over several time periods. I am still working on the tests for this change and would like to ensure correctness for total fatalities in 2018.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

#48

Fixes: #48

@mrengler

This comment has been minimized.

Copy link
Contributor Author

commented Mar 16, 2019

There are supposed to be 73 fatal crashes in 2018, and without duplicates we return 72. The de-duplication changes made here appear to be correct; I think we are missing the 30th fatal crash of the year, which I cannot locate a valid link for. The order of entries returned goes from 29 to 31.

In another note, the changes I would like to make to this PR are 1) de-duplication ruins the order of cases right now, so reorder before the list is returned, and 2) to use the test scenarios instead of the test function.

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 17, 2019

Nicely done, but it can be even simpler!

Here is what I have done:

diff --git a/scrapd/core/apd.py b/scrapd/core/apd.py
index 8848789..2d0973b 100644
--- a/scrapd/core/apd.py
+++ b/scrapd/core/apd.py
@@ -447,7 +447,7 @@ async def fetch_and_parse(session, url):

 async def async_retrieve(pages=-1, from_=None, to=None):
     """Retrieve fatality data."""
-    res = []
+    res = {}
     page = 1
     has_entries = False
     no_date_within_range_count = 0
@@ -500,7 +500,8 @@ async def async_retrieve(pages=-1, from_=None, to=None):
                     break

                 # Otherwise store the results.
-                res += entries_in_time_range
+                for entry in entries_in_time_range:
+                    res.setdefault(entry.get(Fields.CASE), {}).update(entry)

             # Stop if there is no further pages.
             if not has_next(news_page):
@@ -511,6 +512,4 @@ async def async_retrieve(pages=-1, from_=None, to=None):

             page += 1

-    res = remove_duplicate_entries(res)
-
-    return res, page
+    return list(res.values()), page

Only 4 lines, the remove_duplicate_entries() function is not needed, all your tests are passing, and it reads very well 📖

Let me know what you think.

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 17, 2019

Thinking about it a bit more, the update() would update the newest entry with the oldest, which is probably not what we want...

Instead, this would keep the first entry added:

for entry in entries_in_time_range:
    if entry.get(Fields.CASE) not in res:
        res[Fields.CASE] = entry
@mrengler

This comment has been minimized.

Copy link
Contributor Author

commented Mar 17, 2019

Ah, I see! If we don't break out removing duplicates as a separate function, should we just remove the test_remove_duplicate_entries test, since ensure_results will compare the entry_count over this time period anyway?

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Ah yes, the test will be unused, so you can remove it as well.

@mrengler

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

So, the flake8 linter really doesn't like async_retrieve right now lol. With the if-statement it's just too complex, so I added them in reverse order instead, with newer overwriting older. But this is perhaps not as clear, so we could also either ignore flake8 or break out more of async_retrieve into a separate function.

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Hehe, very resourceful! I like that! 👍

However, you're making a full copy of the list in reversed order just to please flake8. This does not seem very efficient.

What I would propose you is:

  1. to adjust the flake8 complexity to whatever you need (https://github.com/scrapd/scrapd/blob/master/setup.cfg#L59)
  2. open an issue to remind us to simplify async_retrieve() or explode it in sub-functions

@mrengler mrengler changed the title Remove duplicate entries(WIP) Remove duplicate entries Mar 18, 2019

@rgreinho
Copy link
Member

left a comment

Nice job! 👍

@mergify mergify bot merged commit c48d72f into scrapd:master Mar 18, 2019

7 checks passed

Mergify — Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test Your tests passed on CircleCI!
Details
coverage/coveralls Coverage increased (+0.4%) to 97.491%
Details

mrengler added a commit to mrengler/scrapd that referenced this pull request Mar 20, 2019

Remove duplicate entries (scrapd#66)
* Added remove_duplicate_entries() function, called before returning list of entries in time frame.

* Added test to remove duplicate entries.

* Formatted remove_duplicate_entries test.

* Sort to preserve original order in de-duplication.

* Added changes to CHANGELOG.

* Changed structure of tests to align with pytest parameterize.

* Removed forgotten debugging printouts.

* Fix total number of 2018 entries, simplify remove_duplicate_entries function,
test length only since order of dicts is not guaranteed.

* Fixed formatting / linting errors.

* Removed remove_duplicate_entries to contain within async_retrieve, removed extra test.

* Removed unused remove_duplicate_entries code.

* Made flake8 allow extra layer of complexity, added if-statement logic to res update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.