-
Notifications
You must be signed in to change notification settings - Fork 15
Conversation
a6f5ca3
to
616d246
Compare
Implements data models representing the resources captured by ScrAPD. This allows us to validate the data we are extracting, reduces the possibility to introduce incorrect values and prevents polluting the final data set. As this is a fundamental change, the PR is way bigger than a normal one, but the following items were also addressed: * Reorganize the modules based on the resources they process. * Reorganize the unit tests in suites and leverages `pytest.param` objects to assign test IDs and simplify the process of adding markers. * Update the formatters to render the models.
616d246
to
58be007
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! I like the looks of pydantic, and I'm looking forward to seeing how the new schema works with the viz page. The two comments I made above probably duplicate stuff I've said elsewhere. I think the test coverage issue is the more important one.
LINK = 'link' | ||
LOCATION = 'location' | ||
LONGITUDE = 'longitude' | ||
MIDDLE_NAME = 'middle' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating more granular fields for the names seems likely to create more problems in parsing future bulletins. I doubt that segmenting names into different parts is creating any value for any user, and we could have avoided the trouble by displaying only a full name field. But I wouldn't suggest taking the time to further change the schema in this PR. It's something to keep in mind if future bulletins fail to parse.
}, | ||
] | ||
|
||
deceased_scenarios = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it, but I don't see a test that starts with a page filename (e.g. 'traffic-fatality-2-3') and verifies that the parser can produce a "deceased" string in the form used as 'input_' in deceased_scenarios
. I think that should be covered for most if not all of the sample pages in the data folder. There are a lot more than 2 variations on the HTML structures in those pages. We should have a lot of coverage for finding the deceased field because that's one of the functions that's most likely to need updating frequently as the APD page changes. I don't want to go back to running the CLI manually and scrolling through the results as a testing strategy, but it seems like that could be necessary the next time a webpage update breaks the scraper.
Types of changes
Description
Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.
As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:
pytest.param
objects to assign test IDs and simplify the process of adding markers.
Checklist:
Fixes: #190