Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Update date parsing and extract to its own function #132

Merged
merged 2 commits into from
May 24, 2019

Conversation

anthonybaulo
Copy link
Member

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Code cleanup / Refactoring

Description

Extracts date parser to its own function with associated tests.
Adds robust date handling with dateparser library, which accepts many date formats from APD, including extraneous descriptors, and returns the format: 'Month Day, Year'.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

Fixes: #105

@anthonybaulo
Copy link
Member Author

@rgreinho , I can't recreate that failure locally. And I'm not exactly sure what the failure is referring to.

However, since there's a peculiarity with my setup, I run make ci and then pytest -s -x -vvv -n0 -m "not integrations", following your instructions from Flightpath. The final message says 99 passed, 3 deselected, 1 warnings. Perhaps that has something to do with why I'm not getting the same results on my end.

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with your second PR! Nailing these issues down!

One small detail to fix regarding the case where no match is found, but once it is fixed, it will be good to go! 👍

)
date = match_pattern(page, date_pattern)
date = search_dates(date)
return date[0][1].strftime("%B %d, %Y") if date else ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format should be strftime(dt, "%m/%d/%Y") to match the existing one.

def test_parse_date_field_00(input_, expected):
"""Ensure a date field gets parsed correctly."""
actual = apd.parse_date_field(input_)
assert actual == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good testing, however there is no invalid case. You should add at least one case which fails to be parsed. this would help you test the case I described in my remark about search_dates().

Add a new line at the end of file.

re.VERBOSE,
)
date = match_pattern(page, date_pattern)
date = search_dates(date)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If search_dates() does not find any date it will return None, so you should account for that too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was accounting for that on the return line (L562)
return date[0][1].strftime("%B %d, %Y") if date else ''
Would you like the return to be None instead of '' if date is None?

Regarding the format, I didn't realize there was an existing format, as the regex seemed to be matching/returning whatever APD entered in the date field, regardless of format.
For future reference, where would I have looked to find the existing format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you're right, I missed the catch on line 562. So you can disregard my comment.

Regarding the format, the regex was returning the raw string, and it was cleaned up later in the code. The clean up function is here: https://github.com/scrapd/scrapd/blob/master/scrapd/core/date_utils.py#L35-L47

@rgreinho
Copy link
Member

rgreinho commented May 23, 2019

Some times the integration tests can be finicky. I tried from my laptop and I also got it to pass. I'll restart the tests a bit later during the day.

@rgreinho
Copy link
Member

Actually it did not work from my laptop either. BefforeI tried from the master branch instead of yours. Sorry the false hope.

For some reason, the fatality with the case number "Case": "18-2760038" "Link": "http://austintexas.gov/news/traffic-fatality-53-5"), is not retrieved with your changes.

@rgreinho
Copy link
Member

To repro the problem:

[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 25s ± git checkout -
Switched to branch 'pr/132'
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
Total: 0
[]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 33s ± git checkout -
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Total: 1
[
  {
    "Age": 40,
    "Case": "18-2760038",
    "DOB": "05/21/1978",
    "Date": "10/03/2018",
    "Ethnicity": "Black",
    "Fatal crashes this year": "53",
    "First Name": "Michael",
    "Gender": "male",
    "Last Name": "Green",
    "Link": "http://austintexas.gov/news/traffic-fatality-53-5",
    "Location": "2500 N IH-35 Southbound",
    "Notes": "The preliminary investigation shows that a 2009 Freightliner truck and attached trailer was stopped in traffic in the 2500 block of N IH-35, in the left lane, when a 2001 Toyota Avalon struck the back side of the trailer, at a high rate of speed. The driver of the Toyota Avalon was pronounced deceased at the scene. APD is investigating this case. Anyone with information regarding this case is asked to call the APD Vehicular Homicide Unit Detectives at (512) 974-5576. You can also submit tips by downloading APD\u2019s mobile app, Austin PD, for free on iPhone and Android. This is Austin\u2019s fifty-third fatal traffic crash of 2018, resulting in fifty-four fatalities this year. At this time in 2017, there were forty-nine fatal traffic crashes and fifty-one traffic fatalities. These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.",
    "Time": "12:27 a.m."
  }
]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 29s ±

@anthonybaulo
Copy link
Member Author

The scrapd command doesn't seem to work when I'm in my branch

[magoo:~/Documents … scrapd-org/scrapd] [venv] issues/105/date-field(+7/-4) ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
-bash: scrapd: command not found

Did you see my questions regarding the format and returns?
#132 (comment)

@rgreinho
Copy link
Member

Because you did not create the venv with the makefile, you have to install scrapd manually in it. Just run pip install -e ..

@rgreinho
Copy link
Member

@anthonybaulo as #125 just got merge your PR will be affected. The main change you will face is that the dates are now stored as a date object instead of strings.

Extracts date parser to its own function with associated tests, and adds robust date handling with `dateparser` library.

Fixes scrapd#105
Updates the date format to ##/##/####.
Provides work-around for instances where search_dates() returns the wrong date when there is a period present in the date (e.g. Oct. 3, 2018 returns 10/23/2019). This has been remedied by replacing all periods with a space before processing with search_dates().
Adds tests for  invalid cases.

Fixes scrapd#105
Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W00t! Great job @anthonybaulo!

@mergify mergify bot merged commit ec395c3 into scrapd:master May 24, 2019
@anthonybaulo anthonybaulo deleted the issues/105/date-field branch June 6, 2019 00:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract the date parsing into its own function
2 participants