Supporting pyspark for data processing #130

aamend · 2021-09-03T19:49:38Z

Setting data model is one thing, ensuring execution of production data pipelines is another. Spark being now de facto standard for enterprise data processing, this module programmatically interprets FIRE entities into spark execution pipelines resulting in the transmission and processing of high quality data in batch or real time.

data schematization: programmatically read FIRE entities, supertypes and references and create their spark schema equivalent. Missing fields are still expected and data types are processed according to standards (e.g. a date is processed as a date and not a string)
data quality: FIRE specifications are programmatically translated into spark SQL constraints (e.g. nullable field, cardinality, minimum, maximum, enums)

Benefits: removing the need for a data / ops team to re-code the FIRE models into pipelines, those are inferred from the JSON files.

Pyspark support for fire entities

aamend · 2021-09-07T23:58:02Z

Team, got the code to support both python 2 and 3. Happy to provide more context around that PR if needed

olliemath · 2021-09-13T15:04:06Z

Hi @aamend - thanks for the high quality pull request (including tests, which is nice to see). We totally welcome this kind of interaction with the Fire schemas, however I think this repo is not the right place for it. This repo is more aimed at pure language-agnostic JSON schemas (the python here is only scripts for testing purposes rather than components of Fire itself). While it's true that Spark is immensely popular for data processing - the schemas are designed to be language agnostic, accessible and useful to a wide variety of institutions with varying tech stacks and capabilities.

On the other hand I think that it would make a good stand-alone python module (have you considered also adding it to PyPI?). If you wanted to make a separate repo for this project we would definitely be up for linking to it (and similar integrations) in the main README - and I could probably offer some help e.g. on linking to the schemas from your project

aamend · 2021-09-13T21:00:30Z

Thanks for the review. I completely get your point and agree in principle. In practice, we need to link both projects, either loosely (as you suggested) or tightly (as proposed in that change). I was thinking this would be the easiest way as it creates a simple bundle to pip install. If not, we'll need to ensure FIRE data is available as a dependency to a e.g. pyspark code

aamend · 2021-12-22T17:41:53Z

Realising wI've kind of dropped the ball here but keen to move forward. I can always publish my own repo but would need to discuss possible integrations first. Any chance you could reach out to antoine.amend@databricks.com?

aamend and others added 11 commits September 2, 2021 23:32

Pyspark support for fire entities

9528d1b

Merge pull request #1 from aamend/pyspark

f243a81

Pyspark support for fire entities

Serializing constraints to JSON

9032df8

Fixing issue with ArrayType()

a730777

Unit tests

f419144

Unit tests

77870b8

Unit tests

312376c

Unit tests

9b75a8a

Dowgrading pyspark version for python 2.x

d4e8121

Dowgrading pyspark version for python 2.x

f67692f

Adding description and supporting python 2.x

99c4231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supporting pyspark for data processing #130

Supporting pyspark for data processing #130

aamend commented Sep 3, 2021

Uh oh!

aamend commented Sep 7, 2021 •

edited

Loading

Uh oh!

olliemath commented Sep 13, 2021

Uh oh!

aamend commented Sep 13, 2021

Uh oh!

aamend commented Dec 22, 2021

Uh oh!

Uh oh!

Supporting pyspark for data processing #130

Are you sure you want to change the base?

Supporting pyspark for data processing #130

Conversation

aamend commented Sep 3, 2021

Uh oh!

aamend commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olliemath commented Sep 13, 2021

Uh oh!

aamend commented Sep 13, 2021

Uh oh!

aamend commented Dec 22, 2021

Uh oh!

Uh oh!

aamend commented Sep 7, 2021 •

edited

Loading