Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean notes field #64

Merged
merged 8 commits into from Mar 15, 2019

Conversation

Projects
None yet
2 participants
@mrengler
Copy link
Contributor

commented Mar 14, 2019

Types of changes

  • Feature enhancement (non-breaking change which improves functionality)

Description

Action items were to:
-Find a way to consistently remove the footer information
-Clean the Notes from all HTML tags and \t
-Remove the victims First, Last, Ethnicity, Gender details (i.e. the first line in the example)

Footer information is removed based on a signal phrase which appears in many descriptions. Also, the HTML tags are removed, which made up the majority of the excess footer characters anyway. Excess whitespace is removed and segments of the victims are (mostly) removed from the Notes by looking for ___ male or _____ female. There are a few cases in which they are not removed, if there are no additional Notes following that segment (the parser has not managed to capture any notes).

This change helps the Notes fields returned to be cleaner and easier to read.

I tested the changes by printing the results of all the parsed Notes fields to a file.

Checklist:

The changes to the parse_details_page_notes methods are all documented in code and the test coverage remains the same.

  • I have updated the documentation accordingly
  • I have written unit tests

#63

Fixes: #63

mrengler added some commits Mar 8, 2019

Merge pull request #1 from scrapd/master
Sync with original repo
@rgreinho
Copy link
Member

left a comment

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

# Demographic info is usually only included in subject description.
# DOB would be better, but that is sometimes missing.
for segment in remove_subjects:
if 'male' in segment or 'female' in segment:

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

The in keyword does partial match, so looking for male is enough to cover both cases.

if 'male' in segment or 'female' in segment:
remove_subjects.remove(segment)

final = ' '.join(remove_subjects)

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

You can even replace it with a nice one liner:

final = ' '.join([segment for segment in remove_subjects if 'male' not in segment])
no_html = re.sub(re.compile('<.*?>'), '', squished[first_cap:])

# Remove tabs and, if subjects included, remove.
remove_subjects = re.split(r'\s{2,}', no_html)

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

It seems fragile to rely on 2 spaces or more. Not sure what would be a better solution without using the HTML tags though 🤔

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 14, 2019

Author Contributor

Yes, I was having a hard time because the split on tabs didn't do much for many of the Notes, because the whitespace was a series of spaces instead 🙄

footer_string = 'Fatality information may change.'

if final.find(footer_string) != -1:
final = final[:final.find(footer_string) + len(footer_string)]

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

I would just try not to run find twice in a row:

end_pos = final.find(footer_string)
if end_pos != -1:
        final = final[:end_pos + len(footer_string)]
@mrengler

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

Thanks for the comments! Definitely will implement the one-liners ASAP. I think I should also be able to write tests using the previous ones as examples—I didn't do it earlier because I thought that the parsing function would change too much, but now that it's in a relatively stable state I can do that. I'll let you know if I have any questions with it!

@rgreinho
Copy link
Member

left a comment

Looks very good! Simply use scenarios with a single test and we are good to merge!

'there were 71 fatal traffic crashes and 76 traffic fatalities. These statements are based on '
'the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.'
)
if actual != expected:

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

This is not needed. Pytest will display a diff for you if you run it in verbose mode:

pytest -x -s -vvv -m 'not integrations' -n0 tests/core/test_apd.py::test_parse_details_page_notes_02

Also after removing this the 2 tests will be identical and you could use scenarios with only 1 test.

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 14, 2019

Author Contributor

Oops, I totally forgot I left that in there lol. Removed!

This is Austin’s 73rd fatal traffic crash of 2018, resulting in 74 fatalities this year. At this time in 2017, there were 71 fatal traffic crashes and 76 traffic fatalities.<br />
<br /><strong><em>These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.</em></strong></p>
"""

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 14, 2019

Member

Add a new line at the end of a file.

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

And to finish, mention your work in the CHANGELOG.

@mergify mergify bot merged commit 049e9a4 into scrapd:master Mar 15, 2019

7 checks passed

Mergify — Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test Your tests passed on CircleCI!
Details
coverage/coveralls Coverage increased (+0.07%) to 97.122%
Details

@rgreinho rgreinho referenced this pull request Mar 15, 2019

Closed

Sanitize the "Notes" field #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.