Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Age without Date of Birth #125

Merged
merged 29 commits into from May 23, 2019
Merged

Parse Age without Date of Birth #125

merged 29 commits into from May 23, 2019

Conversation

@mscarey
Copy link
Contributor

@mscarey mscarey commented May 17, 2019

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to change) (could change parsing outcomes)
  • Code cleanup / Refactoring

Description

I added an alternate parsing function for Deceased fields with no DOB available. Also, I changed the existing parsing functions to convert date strings to datetimes earlier, and then changed the lines that compared datetimes to use < and <= instead of the custom functions is_posterior and is_in_range. (The docstring for is_posterior seems to be wrong about what the function does, but I tried to match its actual functionality.)

This is to fix issue #92, Deceased fields that can't be parsed because they contain no DOB.

I added one new test case with the Deceased field that was failing. That one is passing. There's still one record that's failing and showing age -1, but that seems to be a different issue, involving a DOB format with spaces in it. I also changed some tests to expect datetimes instead of strings.

By the way, I accidentally added these commits on top of my old commits that aren't relevant to this pull request, but I decided to submit the pull request anyway before I make it worse in an attempt to modify the git history. The overall diff for the pull request looks correct to me.

Checklist:

  • [] I have updated the documentation accordingly
  • I have written unit tests

#92

mscarey added 12 commits Apr 17, 2019
Changed the part of parse_deceased_field that removed some items and then used the last remaining item as the First Name. I wasn't able to verify that the issue (scrapd#74, recording a nickname as the first name) is fixed in the cli.
Allows the parse_deceased_field to be tested by passing in strings, not split lists
To handle gender and ethnicity fields in a format like "W/F"
This makes the page content and Twitter parsing function tests expect datetimes.
The goal is to convert date strings to datetimes earlier, so we can more easily compare them chronologically.
Copy link
Member

@rgreinho rgreinho left a comment

Hey Matt!

Good job with the PR. I am just doing a small pre-review until all the checks pass (note for self, I should have a bot sending reminders for me).

  • Good catch with the incorrect docstring for the date functions. However I'd like to keep them as they read easier than the operators. I'll rename them following the moment.js library patterns, fix the docstrings and add unit tests.
  • I would like to keep the sanitize function and the parse functions separated.
  • Have you tried mixing your parse_age_deceased_field with the dob_search function? I think that might be an easy way to piggy back on the existing parsing functions as we should already be able to parse what is on the left side of it.

The biggest remark to this PR however is that all the values should be string. If you use a datetime object, we cannot serialize the entity in JSON without adding custom logic. So this part should be reverted, unless you find a workaround.

@rgreinho
Copy link
Member

@rgreinho rgreinho commented May 19, 2019

The date functions will be fixed/tested/updated in PR #127.

@mscarey
Copy link
Contributor Author

@mscarey mscarey commented May 20, 2019

Hmm, I didn't know there were no tests for those formatters. I added some, but not for the Google Sheets formatter because I don't really understand its requirements. So I don't know whether this PR breaks your Google Sheets workflow, @rgreinho. Could you maybe tell me whether it does, or write a test describing the expected behavior?

About the custom operator functions I want to delete...I think a big advantage of using the builtin operator functions is that you don't have to test or document them.

Right now the only checks failing are failing because the command yapf -r -d -e '*.tox/*' -e '*venv/*' -e '*.eggs/*' . doesn't work on CircleCI. It doesn't work on my machine either. I don't know why.
Or maybe I just couldn't read the linter output.

I do agree that dob_search could be neater, but I think the next step in updating dob_search should be to get it working with issue #128. That's going to require a pretty big change, so any cleanup should happen in the context of solving that problem. After this PR is done, I could look at #128 next if you want.

Copy link
Member

@rgreinho rgreinho left a comment

Great job with the tests for the formatters! I did not write any because they were very simple and just using the stdlib, but I am definitely a big fan of adding tests for them, especially as we're adding custom logic to them.

Delaying the conversion of the values to string is a good idea, and allows us to manipulate Python objects for all the operations without having to perform various conversions and reconversions of their values. I just want to ensure we generate the same output (and you did a good job reformating the date for JSON) for all the formatters (CSV, print).

Regarding the custom operators, I would really like to keep them. I added the missing tests and fixed/updated the documentation, therefore I think there is no reason to delete them. Besides I do strongly think that they read better than the operators.

I don't think fixing #128 will be such a big deal because we already have half of the logic. dateparser is able to search dates in a text, we should leverage this very powerful feature more than we do:

>>> # https://austintexas.gov/news/traffic-fatality-29-3
>>> from dateparser.search import search_dates
>>> search_dates('Patrick Leonard Ervin, Black male, D.O.B. August 30, 1966')
[('August 30, 1966', datetime.datetime(1966, 8, 30, 0, 0))]
>>>

@@ -11,7 +11,8 @@
from tenacity import wait_exponential

from scrapd.core.constant import Fields
from scrapd.core import date_utils
from scrapd.core.date_utils import check_dob, clean_date_string, compute_age
from scrapd.core.date_utils import parse_date, from_date, to_date
Copy link
Member

@rgreinho rgreinho May 22, 2019

Don't organize imports like that. It is clearer for the reader to know from which module the functions come from.

Some more explanations: https://docs.openstack.org/charm-guide/latest/coding-guidelines.html#import-style

:return: A dictionary containing the details information about the fatality with sanitized entries.
:rtype: dict
"""

# All values must be strings.
for k, v in d.items():
if isinstance(v, list):
d[k] = ' '.join(v)

Copy link
Member

@rgreinho rgreinho May 22, 2019

This part should go to the sanitize_fatality_entity() function. The goal of the sanitizing function was to ensure that the final values had the right format, or that unnecessary or invalid values (like None or empty ones) were discarded.

Copy link
Contributor Author

@mscarey mscarey May 23, 2019

Currently that part needs to happen before the deceased field is parsed, while sanitize_fatality_entity() has to happen afterwards.

Fields.GENDER: "male",
Fields.AGE: 19,
},
)))
def test_parse_deceased_field_00(deceased, expected):
Copy link
Member

@rgreinho rgreinho May 22, 2019

The formatting for your test cases makes it harder to read the diff.

I am not sure why the formatter organized your code like that, but I think it is because your last test case entry does not end with a trailling comma. Try replacing line 306 ))) with ),)) and run the formatter again.

f = CSVFormatter(output=stdout)
f.printer(RESULTS)
out, _ = capsys.readouterr()
assert "2005-12-05" in out
Copy link
Member

@rgreinho rgreinho May 22, 2019

This test should fail as the date does not have the right format. We expect the date to be MM/DD/YYYY.

@@ -415,7 +455,7 @@ def parse_deceased_field_common(split_deceased_field, fleg):

# Extract and clean up DOB.
raw_dob = split_deceased_field[-1].strip()
d[Fields.DOB] = date_utils.clean_date_string(raw_dob, True)
d[Fields.DOB] = clean_date_string(raw_dob, True)
Copy link
Member

@rgreinho rgreinho May 22, 2019

Is this still useful? As you use Python objects for manipulating the dates, I believe you could use the parse_date() function directly.

You may even be able to get rid of the clean_date_string() function now.

Copy link
Contributor Author

@mscarey mscarey May 23, 2019

I made the change, but it did result in separate DOB parsing commands for the "page" and "twitter" processes.

def printer(self, results, **kwargs): # noqa: D102
print(json.dumps(results, sort_keys=True, indent=2), file=self.output)
json_string = self.to_json_string(results, **kwargs)
print(json_string, file=self.output)
Copy link
Member

@rgreinho rgreinho May 22, 2019

You don't need to create an extra function here. to_json_string() is not really necessary, you can simply put:

print(json.dumps(results, sort_keys=True, indent=2, default=self.date_serialize), file=self.output)

Copy link
Contributor Author

@mscarey mscarey May 23, 2019

I wanted that line to be in a method with a return value for easier testing. I made more changes for the CSV formatter.

@rgreinho
Copy link
Member

@rgreinho rgreinho commented May 23, 2019

All the changes look good to me! Great job @mscarey! 👍

Once you update your branch, the code should be merged automatically.

@mergify mergify bot merged commit 5a333e0 into scrapd:master May 23, 2019
8 checks passed
@mscarey mscarey mentioned this pull request Jul 15, 2019
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants