You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.
The text was updated successfully, but these errors were encountered:
You can get the gaia_source schema from the GACS TAP interface,
but let me know if you need any help with that or with getting it from
an alternative (e.g. the Dictionary Tool can generate a raw SQL schema
very easily and export into a plain text file, or a java class, or an
xml schema).
In fact, looking ahead, it might be good to create a generic schema-driven approach that can take the Gaia published catalogue data model (as an XML schema for example) and drive the csv-to-parquet conversion including the strongly-typed specification.
A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.
The text was updated successfully, but these errors were encountered: