-
Notifications
You must be signed in to change notification settings - Fork 15
Parse notes from detail page #60
Parse notes from detail page #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good job, I like the improvement!
From:
[remy:~/projects/scrapd/scrapd] [venv] master+ ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l
0
To:
remy:~/projects/scrapd/scrapd] [venv] pr/60+ 4s ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l
9
A few questions and remarks here and there but you definitely found a good way to tackle the problem!
Your patch would absolutely work for now, but there are some note entries which are really unreadable, due to the HTML tags embedded in the text. Once we merge your code, I'll open another ticket to sanitize the notes field.
scrapd/core/apd.py
Outdated
""" | ||
start_tag = details_description.find('<p>') + 3 | ||
end_tag = details_description.find('</p>', start_tag) | ||
if not details_description[start_tag:end_tag].upper().isupper(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.upper().isupper()
always returns True
as long as the string is not empty.
Therefore I think you could you simply test for an empty string:
if not details_description[start_tag:end_tag]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think .upper().isupper() will return False if there are no letters in the string. I tested for strings of whitespace and with numbers only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a better way of testing for strings that contain no letters, though, I'm happy to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Maybe something with not .isalpha()
would work? Otherwise don't sweat it too much, simply put a comment above explaining why you did that and you're good to go.
first_cap = index | ||
break | ||
return squished[first_cap:] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of magic numbers in this function, they could be replaced with something more readable.
An idea would be to define constants and then use their legnth len()
.
BR_TAG = r'<br \>'
start_tag = details_description.find(BR_TAG) + len(BR_TAG)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
scrapd/core/apd.py
Outdated
# Fill in Notes from Details page if not in description. | ||
search_description = re.compile(r'>Deceased:.*\s{2,}(.|\n)*?<\/p>(.|\n)*?<\/p>') | ||
match_d = re.search(search_description, normalized_detail_page) | ||
if match_d and not d.get(Fields.NOTES): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point Fields.NOTES
does not exist yet. Therefore the condition can be simplified to if match_d
.
Nitpicking remark, and not blocking: simply naming the variable match
would read as good as match_d
.
scrapd/core/apd.py
Outdated
if match_d and not d.get(Fields.NOTES): | ||
description = match_d.string[match_d.start(0):match_d.end(0)] | ||
try: | ||
d[Fields.NOTES] = parse_detail_page_description(description) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse_detail_page_description()
function seems to only return the content of the notes. What do you think of renaming to something which indicates the goal? Ideas could be extract_notes()
, or parse_notes_field()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for some reason I thought Notes was called Description at first, then realized it later on but never fixed this.
tests/step_defs/test_retrieve.py
Outdated
result, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range)) | ||
assert result is not None | ||
if 'Notes' in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, the fields don't matter, only the number of entries. You can safely remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will still fail if the field isn't removed, because result will be one entry longer than expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test might be be incorrect then. It is supposed to test the number of entries retrieved within a period of time (a list
of dict
), therefore the length of an entry should not impact it.
Also, to clarify something, this statement is not true:
The results of the twitter field parsing and the page parsing is merged in the The twitter field is easier to parse, therefore I have more confidence into its results and it will overwrite any field retrieved by the |
scrapd/core/apd.py
Outdated
snippet = details_description[start_tag:end_tag] | ||
if snippet[:4] == '<img': | ||
start_tag = details_description.find(r'<br \>') + 6 | ||
squished = details_description[start_tag:end_tag].replace('\n', ' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 196 to 200 are a little confusing. It seems you are updating the start_tag
index if there is no details_description
within the boundaries, or if the following tag is an image. But is there is no details_description
anyway we are going to the end of the function to simply return an empty string.
What do you thing of rewriting it like that:
start_tag = details_description.find('<p>') + 3
end_tag = details_description.find('</p>', start_tag)
snippet = details_description[start_tag:end_tag]
# Return an empty string if there is no snippet.
if not snippet:
return ''
# Update the snippet if the following tag is an image.
if snippet[:4] == '<img':
start_tag = details_description.find(r'<br \>') + 6
snippet = details_description[start_tag:end_tag]
# Remove the end of line characters.
squished = snippet.replace('\n', ' ')
# Look for the first capital letter.
[...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! 👍
Types of changes
Description
I added a function to parse the notes following the Deceased field on the web page. Field.NOTES, if not populated from the twitter metadata already, is populated after the rest of the fields.
Notes that aren't present in the twitter description metadata are not presently being parsed. This PR enables the notes section of the details page to be identified, cleaned up a bit, and returned with the other information extracted from the page about a fatality event.
I tested my changes by printing the output of each test page for the Notes section. They are not perfect: if it's decided that in the worst case, the extracted notes section is actually a detriment to the user, then these changes should not be accepted until additional parsing. However, the notes for the most recent pages are all extracted very nicely—it tries to grab the first thing between paragraph tags after the Deceased row, and it would seem that the details pages are becoming more consistent in HTML styling, in which case the logic here would work on new details pages that are added.
Checklist:
I changed the current unit tests to ignore any comparison of Notes that were from the details page instead of the twitter metadata.
#55
Fixes: #55