-
Notifications
You must be signed in to change notification settings - Fork 45
Supporting pyspark for data processing #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Pyspark support for fire entities
Team, got the code to support both python 2 and 3. Happy to provide more context around that PR if needed |
Hi @aamend - thanks for the high quality pull request (including tests, which is nice to see). We totally welcome this kind of interaction with the Fire schemas, however I think this repo is not the right place for it. This repo is more aimed at pure language-agnostic JSON schemas (the python here is only scripts for testing purposes rather than components of Fire itself). While it's true that Spark is immensely popular for data processing - the schemas are designed to be language agnostic, accessible and useful to a wide variety of institutions with varying tech stacks and capabilities. On the other hand I think that it would make a good stand-alone python module (have you considered also adding it to PyPI?). If you wanted to make a separate repo for this project we would definitely be up for linking to it (and similar integrations) in the main README - and I could probably offer some help e.g. on linking to the schemas from your project |
Thanks for the review. I completely get your point and agree in principle. In practice, we need to link both projects, either loosely (as you suggested) or tightly (as proposed in that change). I was thinking this would be the easiest way as it creates a simple bundle to pip install. If not, we'll need to ensure FIRE data is available as a dependency to a e.g. pyspark code |
Realising wI've kind of dropped the ball here but keen to move forward. I can always publish my own repo but would need to discuss possible integrations first. Any chance you could reach out to antoine.amend@databricks.com? |
Setting data model is one thing, ensuring execution of production data pipelines is another. Spark being now de facto standard for enterprise data processing, this module programmatically interprets FIRE entities into spark execution pipelines resulting in the transmission and processing of high quality data in batch or real time.
Benefits: removing the need for a data / ops team to re-code the FIRE models into pipelines, those are inferred from the JSON files.