Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse notes from detail page #60

Merged
merged 19 commits into from Mar 12, 2019

Conversation

Projects
None yet
2 participants
@mrengler
Copy link
Contributor

commented Mar 9, 2019

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Description

I added a function to parse the notes following the Deceased field on the web page. Field.NOTES, if not populated from the twitter metadata already, is populated after the rest of the fields.

Notes that aren't present in the twitter description metadata are not presently being parsed. This PR enables the notes section of the details page to be identified, cleaned up a bit, and returned with the other information extracted from the page about a fatality event.

I tested my changes by printing the output of each test page for the Notes section. They are not perfect: if it's decided that in the worst case, the extracted notes section is actually a detriment to the user, then these changes should not be accepted until additional parsing. However, the notes for the most recent pages are all extracted very nicely—it tries to grab the first thing between paragraph tags after the Deceased row, and it would seem that the details pages are becoming more consistent in HTML styling, in which case the logic here would work on new details pages that are added.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

I changed the current unit tests to ignore any comparison of Notes that were from the details page instead of the twitter metadata.

#55

Fixes: #55

@rgreinho rgreinho self-requested a review Mar 9, 2019

@rgreinho rgreinho added the enhancement label Mar 9, 2019

@rgreinho
Copy link
Member

left a comment

Wow, good job, I like the improvement!

From:

[remy:~/projects/scrapd/scrapd] [venv] master+ ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l
0

To:

remy:~/projects/scrapd/scrapd] [venv] pr/60+ 4s ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l
9

A few questions and remarks here and there but you definitely found a good way to tackle the problem!

Your patch would absolutely work for now, but there are some note entries which are really unreadable, due to the HTML tags embedded in the text. Once we merge your code, I'll open another ticket to sanitize the notes field.

"""
start_tag = details_description.find('<p>') + 3
end_tag = details_description.find('</p>', start_tag)
if not details_description[start_tag:end_tag].upper().isupper():

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

.upper().isupper() always returns True as long as the string is not empty.

Therefore I think you could you simply test for an empty string:

if not details_description[start_tag:end_tag]:

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 11, 2019

Author Contributor

Actually, I think .upper().isupper() will return False if there are no letters in the string. I tested for strings of whitespace and with numbers only.

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 11, 2019

Author Contributor

If there's a better way of testing for strings that contain no letters, though, I'm happy to use that.

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 11, 2019

Member

Ah I see. Maybe something with not .isalpha() would work? Otherwise don't sweat it too much, simply put a comment above explaining why you did that and you're good to go.

first_cap = index
break
return squished[first_cap:]

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

There are a bunch of magic numbers in this function, they could be replaced with something more readable.

An idea would be to define constants and then use their legnth len().

BR_TAG = r'<br \>'
start_tag = details_description.find(BR_TAG) + len(BR_TAG)

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 11, 2019

Author Contributor

Good idea

# Fill in Notes from Details page if not in description.
search_description = re.compile(r'>Deceased:.*\s{2,}(.|\n)*?<\/p>(.|\n)*?<\/p>')
match_d = re.search(search_description, normalized_detail_page)
if match_d and not d.get(Fields.NOTES):

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

At this point Fields.NOTES does not exist yet. Therefore the condition can be simplified to if match_d.

Nitpicking remark, and not blocking: simply naming the variable match would read as good as match_d.

if match_d and not d.get(Fields.NOTES):
description = match_d.string[match_d.start(0):match_d.end(0)]
try:
d[Fields.NOTES] = parse_detail_page_description(description)

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

The parse_detail_page_description() function seems to only return the content of the notes. What do you think of renaming to something which indicates the goal? Ideas could be extract_notes(), or parse_notes_field().

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 11, 2019

Author Contributor

Yes, for some reason I thought Notes was called Description at first, then realized it later on but never fixed this.

result, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range))
assert result is not None
if 'Notes' in result:

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

For this test, the fields don't matter, only the number of entries. You can safely remove this.

This comment has been minimized.

Copy link
@mrengler

mrengler Mar 11, 2019

Author Contributor

It will still fail if the field isn't removed, because result will be one entry longer than expected

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 11, 2019

Member

The test might be be incorrect then. It is supposed to test the number of entries retrieved within a period of time (a list of dict), therefore the length of an entry should not impact it.

@rgreinho

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

Also, to clarify something, this statement is not true:

Field.NOTES, if not populated from the twitter metadata already, is populated after the rest of the fields.

The results of the twitter field parsing and the page parsing is merged in the parse_page() function (https://github.com/scrapd/scrapd/blob/master/scrapd/core/apd.py#L334).

The twitter field is easier to parse, therefore I have more confidence into its results and it will overwrite any field retrieved by the parse_page_content() function. If you want to implement the behavior you described, you will have to implement it in the parse_page() function.

snippet = details_description[start_tag:end_tag]
if snippet[:4] == '<img':
start_tag = details_description.find(r'<br \>') + 6
squished = details_description[start_tag:end_tag].replace('\n', ' ')

This comment has been minimized.

Copy link
@rgreinho

rgreinho Mar 9, 2019

Member

Lines 196 to 200 are a little confusing. It seems you are updating the start_tag index if there is no details_description within the boundaries, or if the following tag is an image. But is there is no details_description anyway we are going to the end of the function to simply return an empty string.

What do you thing of rewriting it like that:

start_tag = details_description.find('<p>') + 3
end_tag = details_description.find('</p>', start_tag)
snippet = details_description[start_tag:end_tag]

# Return an empty string if there is no snippet.
if not snippet:
    return ''

# Update the snippet if the following tag is an image.
if snippet[:4] == '<img':
    start_tag = details_description.find(r'<br \>') + 6
    snippet = details_description[start_tag:end_tag]

# Remove the end of line characters.
squished = snippet.replace('\n', ' ')

# Look for the first capital letter.
[...]
@rgreinho
Copy link
Member

left a comment

Great job! 👍

@mergify mergify bot merged commit 4b3ed64 into scrapd:master Mar 12, 2019

7 checks passed

Mergify — Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test Your tests passed on CircleCI!
Details
coverage/coveralls Coverage increased (+0.3%) to 97.07%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.