Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Clean notes field #64
Types of changes
Action items were to:
Footer information is removed based on a signal phrase which appears in many descriptions. Also, the HTML tags are removed, which made up the majority of the excess footer characters anyway. Excess whitespace is removed and segments of the victims are (mostly) removed from the Notes by looking for ___ male or _____ female. There are a few cases in which they are not removed, if there are no additional Notes following that segment (the parser has not managed to capture any notes).
This change helps the Notes fields returned to be cleaner and easier to read.
I tested the changes by printing the results of all the parsed Notes fields to a file.
The changes to the parse_details_page_notes methods are all documented in code and the test coverage remains the same.
rgreinho left a comment
This produces very good results!
I just had some minor remarks to simplify the patch.
However we do need better tests to ensure there is no regression in the future.
Thanks for the comments! Definitely will implement the one-liners ASAP. I think I should also be able to write tests using the previous ones as examples—I didn't do it earlier because I thought that the parsing function would change too much, but now that it's in a relatively stable state I can do that. I'll let you know if I have any questions with it!