Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Clean notes field #64

Merged
merged 8 commits into from
Mar 15, 2019
Merged

Clean notes field #64

merged 8 commits into from
Mar 15, 2019

Conversation

mrengler
Copy link
Contributor

Types of changes

  • Feature enhancement (non-breaking change which improves functionality)

Description

Action items were to:
-Find a way to consistently remove the footer information
-Clean the Notes from all HTML tags and \t
-Remove the victims First, Last, Ethnicity, Gender details (i.e. the first line in the example)

Footer information is removed based on a signal phrase which appears in many descriptions. Also, the HTML tags are removed, which made up the majority of the excess footer characters anyway. Excess whitespace is removed and segments of the victims are (mostly) removed from the Notes by looking for ___ male or _____ female. There are a few cases in which they are not removed, if there are no additional Notes following that segment (the parser has not managed to capture any notes).

This change helps the Notes fields returned to be cleaner and easier to read.

I tested the changes by printing the results of all the parsed Notes fields to a file.

Checklist:

The changes to the parse_details_page_notes methods are all documented in code and the test coverage remains the same.

  • I have updated the documentation accordingly
  • I have written unit tests

#63

Fixes: #63

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

# Demographic info is usually only included in subject description.
# DOB would be better, but that is sometimes missing.
for segment in remove_subjects:
if 'male' in segment or 'female' in segment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in keyword does partial match, so looking for male is enough to cover both cases.

if 'male' in segment or 'female' in segment:
remove_subjects.remove(segment)

final = ' '.join(remove_subjects)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even replace it with a nice one liner:

final = ' '.join([segment for segment in remove_subjects if 'male' not in segment])

no_html = re.sub(re.compile('<.*?>'), '', squished[first_cap:])

# Remove tabs and, if subjects included, remove.
remove_subjects = re.split(r'\s{2,}', no_html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fragile to rely on 2 spaces or more. Not sure what would be a better solution without using the HTML tags though 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was having a hard time because the split on tabs didn't do much for many of the Notes, because the whitespace was a series of spaces instead 🙄

footer_string = 'Fatality information may change.'

if final.find(footer_string) != -1:
final = final[:final.find(footer_string) + len(footer_string)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just try not to run find twice in a row:

end_pos = final.find(footer_string)
if end_pos != -1:
        final = final[:end_pos + len(footer_string)]

@mrengler
Copy link
Contributor Author

This produces very good results!

I just had some minor remarks to simplify the patch.

However we do need better tests to ensure there is no regression in the future. test_parse_details_page_notes_01 does not really test anything. You should write the tests using scenarios as it will be the same test for different inputs. I could help you with that part if you are not too familiar with it.

Thanks for the comments! Definitely will implement the one-liners ASAP. I think I should also be able to write tests using the previous ones as examples—I didn't do it earlier because I thought that the parsing function would change too much, but now that it's in a relatively stable state I can do that. I'll let you know if I have any questions with it!

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! Simply use scenarios with a single test and we are good to merge!

'there were 71 fatal traffic crashes and 76 traffic fatalities. These statements are based on '
'the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.'
)
if actual != expected:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. Pytest will display a diff for you if you run it in verbose mode:

pytest -x -s -vvv -m 'not integrations' -n0 tests/core/test_apd.py::test_parse_details_page_notes_02

Also after removing this the 2 tests will be identical and you could use scenarios with only 1 test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I totally forgot I left that in there lol. Removed!

This is Austin’s 73rd fatal traffic crash of 2018, resulting in 74 fatalities this year. At this time in 2017, there were 71 fatal traffic crashes and 76 traffic fatalities.<br />
<br /><strong><em>These statements are based on the initial assessment of the fatal crash and investigation is still pending. Fatality information may change.</em></strong></p>
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end of a file.

@rgreinho
Copy link
Member

And to finish, mention your work in the CHANGELOG.

@mergify mergify bot merged commit 049e9a4 into scrapd:master Mar 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants