Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Implement data models #199

Merged
merged 1 commit into from
Sep 29, 2019
Merged

Implement data models #199

merged 1 commit into from
Sep 29, 2019

Conversation

rgreinho
Copy link
Member

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Code cleanup / Refactoring
  • Documentation

Description

Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:

  • Reorganize the modules based on the resources they process.
  • Reorganize the unit tests in suites and leverages pytest.param
    objects to assign test IDs and simplify the process of adding markers.
  • Update the formatters to render the models.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

Fixes: #190

@rgreinho rgreinho self-assigned this Sep 28, 2019
@rgreinho rgreinho force-pushed the implement-models branch 3 times, most recently from a6f5ca3 to 616d246 Compare September 28, 2019 03:40
Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:
* Reorganize the modules based on the resources they process.
* Reorganize the unit tests in suites and leverages `pytest.param`
  objects to assign test IDs and simplify the process of adding markers.
* Update the formatters to render the models.
Copy link
Contributor

@mscarey mscarey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I like the looks of pydantic, and I'm looking forward to seeing how the new schema works with the viz page. The two comments I made above probably duplicate stuff I've said elsewhere. I think the test coverage issue is the more important one.

LINK = 'link'
LOCATION = 'location'
LONGITUDE = 'longitude'
MIDDLE_NAME = 'middle'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating more granular fields for the names seems likely to create more problems in parsing future bulletins. I doubt that segmenting names into different parts is creating any value for any user, and we could have avoided the trouble by displaying only a full name field. But I wouldn't suggest taking the time to further change the schema in this PR. It's something to keep in mind if future bulletins fail to parse.

},
]

deceased_scenarios = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but I don't see a test that starts with a page filename (e.g. 'traffic-fatality-2-3') and verifies that the parser can produce a "deceased" string in the form used as 'input_' in deceased_scenarios. I think that should be covered for most if not all of the sample pages in the data folder. There are a lot more than 2 variations on the HTML structures in those pages. We should have a lot of coverage for finding the deceased field because that's one of the functions that's most likely to need updating frequently as the APD page changes. I don't want to go back to running the CLI manually and scrolling through the results as a testing strategy, but it seems like that could be necessary the next time a webpage update breaks the scraper.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update data set format to properly store multi-fatality cases
2 participants