Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform Gaia DR2 csv into Parquet #31

Closed
Zarquan opened this issue Jan 7, 2020 · 2 comments
Closed

Transform Gaia DR2 csv into Parquet #31

Zarquan opened this issue Jan 7, 2020 · 2 comments
Assignees

Comments

@Zarquan
Copy link
Collaborator

Zarquan commented Jan 7, 2020

A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.

@NigelHambly
Copy link
Collaborator

You can get the gaia_source schema from the GACS TAP interface,
but let me know if you need any help with that or with getting it from
an alternative (e.g. the Dictionary Tool can generate a raw SQL schema
very easily and export into a plain text file, or a java class, or an
xml schema).

In fact, looking ahead, it might be good to create a generic schema-driven approach that can take the Gaia published catalogue data model (as an XML schema for example) and drive the csv-to-parquet conversion including the strongly-typed specification.

@Zarquan
Copy link
Collaborator Author

Zarquan commented Jan 23, 2020

Closing this issue. Created a new issue #42 to cover next steps.

@Zarquan Zarquan closed this as completed Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants