Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean notes field #64

Merged
merged 8 commits into from Mar 15, 2019
Merged

Clean notes field #64

merged 8 commits into from Mar 15, 2019

Conversation

@mrengler
Copy link
Contributor

@mrengler mrengler commented Mar 14, 2019

Types of changes

  • Feature enhancement (non-breaking change which improves functionality)

Description

Action items were to:
-Find a way to consistently remove the footer information
-Clean the Notes from all HTML tags and \t
-Remove the victims First, Last, Ethnicity, Gender details (i.e. the first line in the example)

Footer information is removed based on a signal phrase which appears in many descriptions. Also, the HTML tags are removed, which made up the majority of the excess footer characters anyway. Excess whitespace is removed and segments of the victims are (mostly) removed from the Notes by looking for ___ male or _____ female. There are a few cases in which they are not removed, if there are no additional Notes following that segment (the parser has not managed to capture any notes).

This change helps the Notes fields returned to be cleaner and easier to read.

I tested the changes by printing the results of all the parsed Notes fields to a file.

Checklist:

The changes to the parse_details_page_notes methods are all documented in code and the test coverage remains the same.

  • I have updated the documentation accordingly
  • I have written unit tests

#63

Fixes: #63

Copy link
Member

@rgreinho rgreinho left a comment

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

# Demographic info is usually only included in subject description.
# DOB would be better, but that is sometimes missing.
for segment in remove_subjects:
if 'male' in segment or 'female' in segment:

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

The in keyword does partial match, so looking for male is enough to cover both cases.

if 'male' in segment or 'female' in segment:
remove_subjects.remove(segment)

final = ' '.join(remove_subjects)

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

You can even replace it with a nice one liner:

final = ' '.join([segment for segment in remove_subjects if 'male' not in segment])
no_html = re.sub(re.compile('<.*?>'), '', squished[first_cap:])

# Remove tabs and, if subjects included, remove.
remove_subjects = re.split(r'\s{2,}', no_html)

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

It seems fragile to rely on 2 spaces or more. Not sure what would be a better solution without using the HTML tags though 🤔

This comment has been minimized.

@mrengler

mrengler Mar 14, 2019
Author Contributor

Yes, I was having a hard time because the split on tabs didn't do much for many of the Notes, because the whitespace was a series of spaces instead 🙄

footer_string = 'Fatality information may change.'

if final.find(footer_string) != -1:
final = final[:final.find(footer_string) + len(footer_string)]

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

I would just try not to run find twice in a row:

end_pos = final.find(footer_string)
if end_pos != -1:
        final = final[:end_pos + len(footer_string)]
@mrengler
Copy link
Contributor Author

@mrengler mrengler commented Mar 14, 2019

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

Thanks for the comments! Definitely will implement the one-liners ASAP. I think I should also be able to write tests using the previous ones as examples—I didn't do it earlier because I thought that the parsing function would change too much, but now that it's in a relatively stable state I can do that. I'll let you know if I have any questions with it!

Copy link
Member

@rgreinho rgreinho left a comment

Looks very good! Simply use scenarios with a single test and we are good to merge!

'there were 71 fatal traffic crashes and 76 traffic fatalities. These statements are based on '
'the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.'
)
if actual != expected:

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

This is not needed. Pytest will display a diff for you if you run it in verbose mode:

pytest -x -s -vvv -m 'not integrations' -n0 tests/core/test_apd.py::test_parse_details_page_notes_02

Also after removing this the 2 tests will be identical and you could use scenarios with only 1 test.

This comment has been minimized.

@mrengler

mrengler Mar 14, 2019
Author Contributor

Oops, I totally forgot I left that in there lol. Removed!

This is Austin’s 73rd fatal traffic crash of 2018, resulting in 74 fatalities this year. At this time in 2017, there were 71 fatal traffic crashes and 76 traffic fatalities.<br />
<br /><strong><em>These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.</em></strong></p>
"""

This comment has been minimized.

@rgreinho

rgreinho Mar 14, 2019
Member

Add a new line at the end of a file.

@rgreinho
Copy link
Member

@rgreinho rgreinho commented Mar 14, 2019

And to finish, mention your work in the CHANGELOG.

@mergify mergify bot merged commit 049e9a4 into scrapd:master Mar 15, 2019
7 checks passed
7 checks passed
@mergify
Mergify — Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test Your tests passed on CircleCI!
Details
coverage/coveralls Coverage increased (+0.07%) to 97.122%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants