Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update date parsing and extract to its own function #132

Merged
merged 2 commits into from May 24, 2019

Conversation

Projects
None yet
2 participants
@anthonybaulo
Copy link
Contributor

commented May 23, 2019

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Code cleanup / Refactoring

Description

Extracts date parser to its own function with associated tests.
Adds robust date handling with dateparser library, which accepts many date formats from APD, including extraneous descriptors, and returns the format: 'Month Day, Year'.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

Fixes: #105

@anthonybaulo

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

@rgreinho , I can't recreate that failure locally. And I'm not exactly sure what the failure is referring to.

However, since there's a peculiarity with my setup, I run make ci and then pytest -s -x -vvv -n0 -m "not integrations", following your instructions from Flightpath. The final message says 99 passed, 3 deselected, 1 warnings. Perhaps that has something to do with why I'm not getting the same results on my end.

@rgreinho
Copy link
Member

left a comment

Great job with your second PR! Nailing these issues down!

One small detail to fix regarding the case where no match is found, but once it is fixed, it will be good to go! 👍

)
date = match_pattern(page, date_pattern)
date = search_dates(date)
return date[0][1].strftime("%B %d, %Y") if date else ''

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 23, 2019

Member

The format should be strftime(dt, "%m/%d/%Y") to match the existing one.

def test_parse_date_field_00(input_, expected):
"""Ensure a date field gets parsed correctly."""
actual = apd.parse_date_field(input_)
assert actual == expected

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 23, 2019

Member

Good testing, however there is no invalid case. You should add at least one case which fails to be parsed. this would help you test the case I described in my remark about search_dates().

Add a new line at the end of file.

re.VERBOSE,
)
date = match_pattern(page, date_pattern)
date = search_dates(date)

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 23, 2019

Member

If search_dates() does not find any date it will return None, so you should account for that too.

This comment has been minimized.

Copy link
@anthonybaulo

anthonybaulo May 23, 2019

Author Contributor

I was accounting for that on the return line (L562)
return date[0][1].strftime("%B %d, %Y") if date else ''
Would you like the return to be None instead of '' if date is None?

Regarding the format, I didn't realize there was an existing format, as the regex seemed to be matching/returning whatever APD entered in the date field, regardless of format.
For future reference, where would I have looked to find the existing format?

This comment has been minimized.

Copy link
@rgreinho

rgreinho May 23, 2019

Member

Ah, yes, you're right, I missed the catch on line 562. So you can disregard my comment.

Regarding the format, the regex was returning the raw string, and it was cleaned up later in the code. The clean up function is here: https://github.com/scrapd/scrapd/blob/master/scrapd/core/date_utils.py#L35-L47

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Some times the integration tests can be finicky. I tried from my laptop and I also got it to pass. I'll restart the tests a bit later during the day.

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Actually it did not work from my laptop either. BefforeI tried from the master branch instead of yours. Sorry the false hope.

For some reason, the fatality with the case number "Case": "18-2760038" "Link": "http://austintexas.gov/news/traffic-fatality-53-5"), is not retrieved with your changes.

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

To repro the problem:

[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 25s ± git checkout -
Switched to branch 'pr/132'
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
Total: 0
[]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] pr/132 33s ± git checkout -
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Total: 1
[
  {
    "Age": 40,
    "Case": "18-2760038",
    "DOB": "05/21/1978",
    "Date": "10/03/2018",
    "Ethnicity": "Black",
    "Fatal crashes this year": "53",
    "First Name": "Michael",
    "Gender": "male",
    "Last Name": "Green",
    "Link": "http://austintexas.gov/news/traffic-fatality-53-5",
    "Location": "2500 N IH-35 Southbound",
    "Notes": "The preliminary investigation shows that a 2009 Freightliner truck and attached trailer was stopped in traffic in the 2500 block of N IH-35, in the left lane, when a 2001 Toyota Avalon struck the back side of the trailer, at a high rate of speed. The driver of the Toyota Avalon was pronounced deceased at the scene. APD is investigating this case. Anyone with information regarding this case is asked to call the APD Vehicular Homicide Unit Detectives at (512) 974-5576. You can also submit tips by downloading APD\u2019s mobile app, Austin PD, for free on iPhone and Android. This is Austin\u2019s fifty-third fatal traffic crash of 2018, resulting in fifty-four fatalities this year. At this time in 2017, there were forty-nine fatal traffic crashes and fifty-one traffic fatalities. These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.",
    "Time": "12:27 a.m."
  }
]
[remygreinhofer:~/remy-project … apd-org/scrapd] [venv] master 29s ±
@anthonybaulo

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

The scrapd command doesn't seem to work when I'm in my branch

[magoo:~/Documents … scrapd-org/scrapd] [venv] issues/105/date-field(+7/-4) ± scrapd -v --from 'Oct 3 2018' --to 'Oct 3 2018'
-bash: scrapd: command not found

Did you see my questions regarding the format and returns?
#132 (comment)

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Because you did not create the venv with the makefile, you have to install scrapd manually in it. Just run pip install -e ..

@rgreinho

This comment has been minimized.

Copy link
Member

commented May 23, 2019

@anthonybaulo as #125 just got merge your PR will be affected. The main change you will face is that the dates are now stored as a date object instead of strings.

anthonybaulo added some commits May 23, 2019

Extract date parsing to its own function
Extracts date parser to its own function with associated tests, and adds robust date handling with `dateparser` library.

Fixes #105
Add requested changes to PR
Updates the date format to ##/##/####.
Provides work-around for instances where search_dates() returns the wrong date when there is a period present in the date (e.g. Oct. 3, 2018 returns 10/23/2019). This has been remedied by replacing all periods with a space before processing with search_dates().
Adds tests for  invalid cases.

Fixes #105

@anthonybaulo anthonybaulo force-pushed the anthonybaulo:issues/105/date-field branch from 4414af8 to e2c94fd May 24, 2019

@rgreinho
Copy link
Member

left a comment

W00t! Great job @anthonybaulo!

@mergify mergify bot merged commit ec395c3 into scrapd:master May 24, 2019

8 checks passed

Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test-integrations Your tests passed on CircleCI!
Details
ci/circleci: test-units Your tests passed on CircleCI!
Details
coverage/coveralls Coverage remained the same at 100.0%
Details

@anthonybaulo anthonybaulo deleted the anthonybaulo:issues/105/date-field branch Jun 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.