Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data models #199

Merged
merged 1 commit into from Sep 29, 2019
Merged

Implement data models #199

merged 1 commit into from Sep 29, 2019

Conversation

@rgreinho
Copy link
Member

@rgreinho rgreinho commented Sep 28, 2019

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Code cleanup / Refactoring
  • Documentation

Description

Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:

  • Reorganize the modules based on the resources they process.
  • Reorganize the unit tests in suites and leverages pytest.param
    objects to assign test IDs and simplify the process of adding markers.
  • Update the formatters to render the models.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

Fixes: #190

@rgreinho rgreinho requested a review from mscarey Sep 28, 2019
@rgreinho rgreinho self-assigned this Sep 28, 2019
@rgreinho rgreinho force-pushed the implement-models branch 3 times, most recently from a6f5ca3 to 616d246 Sep 28, 2019
Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:
* Reorganize the modules based on the resources they process.
* Reorganize the unit tests in suites and leverages `pytest.param`
  objects to assign test IDs and simplify the process of adding markers.
* Update the formatters to render the models.
@rgreinho rgreinho force-pushed the implement-models branch from 616d246 to 58be007 Sep 28, 2019
Copy link
Contributor

@mscarey mscarey left a comment

Good job! I like the looks of pydantic, and I'm looking forward to seeing how the new schema works with the viz page. The two comments I made above probably duplicate stuff I've said elsewhere. I think the test coverage issue is the more important one.

LINK = 'link'
LOCATION = 'location'
LONGITUDE = 'longitude'
MIDDLE_NAME = 'middle'
Copy link
Contributor

@mscarey mscarey Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating more granular fields for the names seems likely to create more problems in parsing future bulletins. I doubt that segmenting names into different parts is creating any value for any user, and we could have avoided the trouble by displaying only a full name field. But I wouldn't suggest taking the time to further change the schema in this PR. It's something to keep in mind if future bulletins fail to parse.

},
]

deceased_scenarios = [
Copy link
Contributor

@mscarey mscarey Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but I don't see a test that starts with a page filename (e.g. 'traffic-fatality-2-3') and verifies that the parser can produce a "deceased" string in the form used as 'input_' in deceased_scenarios. I think that should be covered for most if not all of the sample pages in the data folder. There are a lot more than 2 variations on the HTML structures in those pages. We should have a lot of coverage for finding the deceased field because that's one of the functions that's most likely to need updating frequently as the APD page changes. I don't want to go back to running the CLI manually and scrolling through the results as a testing strategy, but it seems like that could be necessary the next time a webpage update breaks the scraper.

@mergify mergify bot merged commit c996508 into scrapd:master Sep 29, 2019
9 checks passed
@rgreinho rgreinho deleted the implement-models branch Sep 30, 2019
@rgreinho rgreinho mentioned this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants