Transform Gaia DR2 csv into Parquet #31

Zarquan · 2020-01-07T11:27:04Z

A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.

NigelHambly · 2020-01-09T11:12:45Z

You can get the gaia_source schema from the GACS TAP interface,
but let me know if you need any help with that or with getting it from
an alternative (e.g. the Dictionary Tool can generate a raw SQL schema
very easily and export into a plain text file, or a java class, or an
xml schema).

In fact, looking ahead, it might be good to create a generic schema-driven approach that can take the Gaia published catalogue data model (as an XML schema for example) and drive the csv-to-parquet conversion including the strongly-typed specification.

Zarquan · 2020-01-23T14:40:49Z

Closing this issue. Created a new issue #42 to cover next steps.

Zarquan assigned stvoutsin Jan 7, 2020

Zarquan mentioned this issue Jan 7, 2020

Deploy a Spark/Yarn/Hadoop Cluster with Gaia DR2 on Openstack #26

Closed

stvoutsin mentioned this issue Jan 13, 2020

Stv issue 30 #38

Merged

Zarquan mentioned this issue Jan 23, 2020

Generate import schema from Gaia data model #42

Closed

Zarquan closed this as completed Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform Gaia DR2 csv into Parquet #31

Transform Gaia DR2 csv into Parquet #31

Zarquan commented Jan 7, 2020

NigelHambly commented Jan 9, 2020

Zarquan commented Jan 23, 2020

Transform Gaia DR2 csv into Parquet #31

Transform Gaia DR2 csv into Parquet #31

Comments

Zarquan commented Jan 7, 2020

NigelHambly commented Jan 9, 2020

Zarquan commented Jan 23, 2020