Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Parse notes from detail page #60
Types of changes
I added a function to parse the notes following the Deceased field on the web page. Field.NOTES, if not populated from the twitter metadata already, is populated after the rest of the fields.
Notes that aren't present in the twitter description metadata are not presently being parsed. This PR enables the notes section of the details page to be identified, cleaned up a bit, and returned with the other information extracted from the page about a fatality event.
I tested my changes by printing the output of each test page for the Notes section. They are not perfect: if it's decided that in the worst case, the extracted notes section is actually a detriment to the user, then these changes should not be accepted until additional parsing. However, the notes for the most recent pages are all extracted very nicely—it tries to grab the first thing between paragraph tags after the Deceased row, and it would seem that the details pages are becoming more consistent in HTML styling, in which case the logic here would work on new details pages that are added.
I changed the current unit tests to ignore any comparison of Notes that were from the details page instead of the twitter metadata.
rgreinho left a comment
Wow, good job, I like the improvement!
[remy:~/projects/scrapd/scrapd] [venv] master+ ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l 0
remy:~/projects/scrapd/scrapd] [venv] pr/60+ 4s ± scrapd --from "Jan 2019" --format json|grep -i "notes"|wc -l 9
A few questions and remarks here and there but you definitely found a good way to tackle the problem!
Your patch would absolutely work for now, but there are some note entries which are really unreadable, due to the HTML tags embedded in the text. Once we merge your code, I'll open another ticket to sanitize the notes field.
Also, to clarify something, this statement is not true:
The results of the twitter field parsing and the page parsing is merged in the
The twitter field is easier to parse, therefore I have more confidence into its results and it will overwrite any field retrieved by the