Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Age without Date of Birth #125

Merged
merged 29 commits into from May 23, 2019

Conversation

Projects
None yet
2 participants
@mscarey
Copy link
Contributor

commented May 17, 2019

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to change) (could change parsing outcomes)
  • Code cleanup / Refactoring

Description

I added an alternate parsing function for Deceased fields with no DOB available. Also, I changed the existing parsing functions to convert date strings to datetimes earlier, and then changed the lines that compared datetimes to use < and <= instead of the custom functions is_posterior and is_in_range. (The docstring for is_posterior seems to be wrong about what the function does, but I tried to match its actual functionality.)

This is to fix issue #92, Deceased fields that can't be parsed because they contain no DOB.

I added one new test case with the Deceased field that was failing. That one is passing. There's still one record that's failing and showing age -1, but that seems to be a different issue, involving a DOB format with spaces in it. I also changed some tests to expect datetimes instead of strings.

By the way, I accidentally added these commits on top of my old commits that aren't relevant to this pull request, but I decided to submit the pull request anyway before I make it worse in an attempt to modify the git history. The overall diff for the pull request looks correct to me.

Checklist:

  • [] I have updated the documentation accordingly
  • I have written unit tests

#92

mscarey added some commits Apr 17, 2019

use first item of deceased_field for First Name
Changed the part of parse_deceased_field that removed some items and then used the last remaining item as the First Name. I wasn't able to verify that the issue (#74, recording a nickname as the first name) is fixed in the cli.
move split method within deceased field parser
Allows the parse_deceased_field to be tested by passing in strings, not split lists
split deceased field on some slashes
To handle gender and ethnicity fields in a format like "W/F"
change parsing tests to expect datetimes
This makes the page content and Twitter parsing function tests expect datetimes.
The goal is to convert date strings to datetimes earlier, so we can more easily compare them chronologically.
@rgreinho
Copy link
Member

left a comment

Hey Matt!

Good job with the PR. I am just doing a small pre-review until all the checks pass (note for self, I should have a bot sending reminders for me).

  • Good catch with the incorrect docstring for the date functions. However I'd like to keep them as they read easier than the operators. I'll rename them following the moment.js library patterns, fix the docstrings and add unit tests.
  • I would like to keep the sanitize function and the parse functions separated.
  • Have you tried mixing your parse_age_deceased_field with the dob_search function? I think that might be an easy way to piggy back on the existing parsing functions as we should already be able to parse what is on the left side of it.

The biggest remark to this PR however is that all the values should be string. If you use a datetime object, we cannot serialize the entity in JSON without adding custom logic. So this part should be reverted, unless you find a workaround.

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 19, 2019

The date functions will be fixed/tested/updated in PR #127.

mscarey added some commits May 20, 2019

change datetime objects to date objects
None of the datetime objects was using the time fields.
remove ValueError for failed age parsing.
It raises an AttributeError anyway
slightly delay conversion of datetime to date
This way failure raises the exception expected by the tests
@mscarey

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2019

Hmm, I didn't know there were no tests for those formatters. I added some, but not for the Google Sheets formatter because I don't really understand its requirements. So I don't know whether this PR breaks your Google Sheets workflow, @rgreinho. Could you maybe tell me whether it does, or write a test describing the expected behavior?

About the custom operator functions I want to delete...I think a big advantage of using the builtin operator functions is that you don't have to test or document them.

Right now the only checks failing are failing because the command yapf -r -d -e '*.tox/*' -e '*venv/*' -e '*.eggs/*' . doesn't work on CircleCI. It doesn't work on my machine either. I don't know why.
Or maybe I just couldn't read the linter output.

I do agree that dob_search could be neater, but I think the next step in updating dob_search should be to get it working with issue #128. That's going to require a pretty big change, so any cleanup should happen in the context of solving that problem. After this PR is done, I could look at #128 next if you want.

@rgreinho
Copy link
Member

left a comment

Great job with the tests for the formatters! I did not write any because they were very simple and just using the stdlib, but I am definitely a big fan of adding tests for them, especially as we're adding custom logic to them.

Delaying the conversion of the values to string is a good idea, and allows us to manipulate Python objects for all the operations without having to perform various conversions and reconversions of their values. I just want to ensure we generate the same output (and you did a good job reformating the date for JSON) for all the formatters (CSV, print).

Regarding the custom operators, I would really like to keep them. I added the missing tests and fixed/updated the documentation, therefore I think there is no reason to delete them. Besides I do strongly think that they read better than the operators.

I don't think fixing #128 will be such a big deal because we already have half of the logic. dateparser is able to search dates in a text, we should leverage this very powerful feature more than we do:

>>> # https://austintexas.gov/news/traffic-fatality-29-3
>>> from dateparser.search import search_dates
>>> search_dates('Patrick Leonard Ervin, Black male, D.O.B. August 30, 1966')
[('August 30, 1966', datetime.datetime(1966, 8, 30, 0, 0))]
>>>
@@ -11,7 +11,8 @@
from tenacity import wait_exponential

from scrapd.core.constant import Fields
from scrapd.core import date_utils
from scrapd.core.date_utils import check_dob, clean_date_string, compute_age
from scrapd.core.date_utils import parse_date, from_date, to_date

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

Don't organize imports like that. It is clearer for the reader to know from which module the functions come from.

Some more explanations: https://docs.openstack.org/charm-guide/latest/coding-guidelines.html#import-style

:return: A dictionary containing the details information about the fatality with sanitized entries.
:rtype: dict
"""

# All values must be strings.
for k, v in d.items():
if isinstance(v, list):
d[k] = ' '.join(v)

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

This part should go to the sanitize_fatality_entity() function. The goal of the sanitizing function was to ensure that the final values had the right format, or that unnecessary or invalid values (like None or empty ones) were discarded.

This comment has been minimized.

Copy link
@mscarey

mscarey May 23, 2019

Author Contributor

Currently that part needs to happen before the deceased field is parsed, while sanitize_fatality_entity() has to happen afterwards.

Fields.GENDER: "male",
Fields.AGE: 19,
},
)))
def test_parse_deceased_field_00(deceased, expected):

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

The formatting for your test cases makes it harder to read the diff.

I am not sure why the formatter organized your code like that, but I think it is because your last test case entry does not end with a trailling comma. Try replacing line 306 ))) with ),)) and run the formatter again.

f = CSVFormatter(output=stdout)
f.printer(RESULTS)
out, _ = capsys.readouterr()
assert "2005-12-05" in out

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

This test should fail as the date does not have the right format. We expect the date to be MM/DD/YYYY.

@@ -415,7 +455,7 @@ def parse_deceased_field_common(split_deceased_field, fleg):

# Extract and clean up DOB.
raw_dob = split_deceased_field[-1].strip()
d[Fields.DOB] = date_utils.clean_date_string(raw_dob, True)
d[Fields.DOB] = clean_date_string(raw_dob, True)

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

Is this still useful? As you use Python objects for manipulating the dates, I believe you could use the parse_date() function directly.

You may even be able to get rid of the clean_date_string() function now.

This comment has been minimized.

Copy link
@mscarey

mscarey May 23, 2019

Author Contributor

I made the change, but it did result in separate DOB parsing commands for the "page" and "twitter" processes.

def printer(self, results, **kwargs): # noqa: D102
print(json.dumps(results, sort_keys=True, indent=2), file=self.output)
json_string = self.to_json_string(results, **kwargs)
print(json_string, file=self.output)

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 22, 2019

Member

You don't need to create an extra function here. to_json_string() is not really necessary, you can simply put:

print(json.dumps(results, sort_keys=True, indent=2, default=self.date_serialize), file=self.output)

This comment has been minimized.

Copy link
@mscarey

mscarey May 23, 2019

Author Contributor

I wanted that line to be in a method with a return value for easier testing. I made more changes for the CSV formatter.

mscarey added some commits May 23, 2019

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

All the changes look good to me! Great job @mscarey! 👍

Once you update your branch, the code should be merged automatically.

@mergify mergify bot merged commit 5a333e0 into scrapd:master May 23, 2019

8 checks passed

Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test-integrations Your tests passed on CircleCI!
Details
ci/circleci: test-units Your tests passed on CircleCI!
Details
coverage/coveralls Coverage remained the same at 100.0%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.