-
Notifications
You must be signed in to change notification settings - Fork 15
Update date parsing and extract to its own function #132
Conversation
@rgreinho , I can't recreate that failure locally. And I'm not exactly sure what the failure is referring to. However, since there's a peculiarity with my setup, I run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job with your second PR! Nailing these issues down!
One small detail to fix regarding the case where no match is found, but once it is fixed, it will be good to go! 👍
scrapd/core/apd.py
Outdated
) | ||
date = match_pattern(page, date_pattern) | ||
date = search_dates(date) | ||
return date[0][1].strftime("%B %d, %Y") if date else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format should be strftime(dt, "%m/%d/%Y")
to match the existing one.
tests/core/test_apd.py
Outdated
def test_parse_date_field_00(input_, expected): | ||
"""Ensure a date field gets parsed correctly.""" | ||
actual = apd.parse_date_field(input_) | ||
assert actual == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good testing, however there is no invalid case. You should add at least one case which fails to be parsed. this would help you test the case I described in my remark about search_dates()
.
Add a new line at the end of file.
re.VERBOSE, | ||
) | ||
date = match_pattern(page, date_pattern) | ||
date = search_dates(date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If search_dates()
does not find any date it will return None
, so you should account for that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was accounting for that on the return line (L562)
return date[0][1].strftime("%B %d, %Y") if date else ''
Would you like the return to be None
instead of ''
if date is None
?
Regarding the format, I didn't realize there was an existing format, as the regex seemed to be matching/returning whatever APD entered in the date field, regardless of format.
For future reference, where would I have looked to find the existing format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, you're right, I missed the catch on line 562. So you can disregard my comment.
Regarding the format, the regex was returning the raw string, and it was cleaned up later in the code. The clean up function is here: https://github.com/scrapd/scrapd/blob/master/scrapd/core/date_utils.py#L35-L47
Some times the integration tests can be finicky. I tried from my laptop and I also got it to pass. I'll restart the tests a bit later during the day. |
Actually it did not work from my laptop either. BefforeI tried from the master branch instead of yours. Sorry the false hope. For some reason, the fatality with the case number |
To repro the problem: [remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 25s ± git checkout -
Switched to branch 'pr/132'
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
Total: 0
[]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 33s ± git checkout -
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Total: 1
[
{
"Age": 40,
"Case": "18-2760038",
"DOB": "05/21/1978",
"Date": "10/03/2018",
"Ethnicity": "Black",
"Fatal crashes this year": "53",
"First Name": "Michael",
"Gender": "male",
"Last Name": "Green",
"Link": "http://austintexas.gov/news/traffic-fatality-53-5",
"Location": "2500 N IH-35 Southbound",
"Notes": "The preliminary investigation shows that a 2009 Freightliner truck and attached trailer was stopped in traffic in the 2500 block of N IH-35, in the left lane, when a 2001 Toyota Avalon struck the back side of the trailer, at a high rate of speed. The driver of the Toyota Avalon was pronounced deceased at the scene. APD is investigating this case. Anyone with information regarding this case is asked to call the APD Vehicular Homicide Unit Detectives at (512) 974-5576. You can also submit tips by downloading APD\u2019s mobile app, Austin PD, for free on iPhone and Android. This is Austin\u2019s fifty-third fatal traffic crash of 2018, resulting in fifty-four fatalities this year. At this time in 2017, there were forty-nine fatal traffic crashes and fifty-one traffic fatalities. These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.",
"Time": "12:27 a.m."
}
]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 29s ± |
The
Did you see my questions regarding the format and returns? |
Because you did not create the venv with the makefile, you have to install scrapd manually in it. Just run |
@anthonybaulo as #125 just got merge your PR will be affected. The main change you will face is that the dates are now stored as a date object instead of strings. |
Extracts date parser to its own function with associated tests, and adds robust date handling with `dateparser` library. Fixes scrapd#105
Updates the date format to ##/##/####. Provides work-around for instances where search_dates() returns the wrong date when there is a period present in the date (e.g. Oct. 3, 2018 returns 10/23/2019). This has been remedied by replacing all periods with a space before processing with search_dates(). Adds tests for invalid cases. Fixes scrapd#105
4414af8
to
e2c94fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W00t! Great job @anthonybaulo!
Types of changes
Description
Extracts date parser to its own function with associated tests.
Adds robust date handling with
dateparser
library, which accepts many date formats from APD, including extraneous descriptors, and returns the format: 'Month Day, Year'.Checklist:
Fixes: #105