Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data models #199

Merged
merged 1 commit into from Sep 29, 2019

Conversation

@rgreinho
Copy link
Member

rgreinho commented Sep 28, 2019

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Code cleanup / Refactoring
  • Documentation

Description

Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:

  • Reorganize the modules based on the resources they process.
  • Reorganize the unit tests in suites and leverages pytest.param
    objects to assign test IDs and simplify the process of adding markers.
  • Update the formatters to render the models.

Checklist:

  • I have updated the documentation accordingly
  • I have written unit tests

Fixes: #190

@rgreinho rgreinho requested a review from mscarey Sep 28, 2019
@rgreinho rgreinho self-assigned this Sep 28, 2019
@rgreinho rgreinho force-pushed the rgreinho:implement-models branch 3 times, most recently from a6f5ca3 to 616d246 Sep 28, 2019
Implements data models representing the resources captured by ScrAPD.
This allows us to validate the data we are extracting, reduces the
possibility to introduce incorrect values and prevents polluting the
final data set.

As this is a fundamental change, the PR is way bigger than a normal
one, but the following items were also addressed:
* Reorganize the modules based on the resources they process.
* Reorganize the unit tests in suites and leverages `pytest.param`
  objects to assign test IDs and simplify the process of adding markers.
* Update the formatters to render the models.
@rgreinho rgreinho force-pushed the rgreinho:implement-models branch from 616d246 to 58be007 Sep 28, 2019
Copy link
Contributor

mscarey left a comment

Good job! I like the looks of pydantic, and I'm looking forward to seeing how the new schema works with the viz page. The two comments I made above probably duplicate stuff I've said elsewhere. I think the test coverage issue is the more important one.

LINK = 'link'
LOCATION = 'location'
LONGITUDE = 'longitude'
MIDDLE_NAME = 'middle'

This comment has been minimized.

Copy link
@mscarey

mscarey Sep 29, 2019

Contributor

Creating more granular fields for the names seems likely to create more problems in parsing future bulletins. I doubt that segmenting names into different parts is creating any value for any user, and we could have avoided the trouble by displaying only a full name field. But I wouldn't suggest taking the time to further change the schema in this PR. It's something to keep in mind if future bulletins fail to parse.

},
]

deceased_scenarios = [

This comment has been minimized.

Copy link
@mscarey

mscarey Sep 29, 2019

Contributor

Maybe I missed it, but I don't see a test that starts with a page filename (e.g. 'traffic-fatality-2-3') and verifies that the parser can produce a "deceased" string in the form used as 'input_' in deceased_scenarios. I think that should be covered for most if not all of the sample pages in the data folder. There are a lot more than 2 variations on the HTML structures in those pages. We should have a lot of coverage for finding the deceased field because that's one of the functions that's most likely to need updating frequently as the APD page changes. I don't want to go back to running the CLI manually and scrolling through the results as a testing strategy, but it seems like that could be necessary the next time a webpage update breaks the scraper.

@mergify mergify bot merged commit c996508 into scrapd:master Sep 29, 2019
9 checks passed
9 checks passed
Summary 1 rule matches
Details
ci/circleci: docs Your tests passed on CircleCI!
Details
ci/circleci: format Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: prepare Your tests passed on CircleCI!
Details
ci/circleci: test-integrations Your tests passed on CircleCI!
Details
ci/circleci: test-units Your tests passed on CircleCI!
Details
coverage/coveralls Coverage remained the same at 100.0%
Details
security/snyk - requirements.txt (rgreinho) No new issues
Details
@rgreinho rgreinho deleted the rgreinho:implement-models branch Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.