Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: keep Postgres static schema #371

Closed
szareiangm opened this issue Jun 26, 2018 · 11 comments
Closed

Server: keep Postgres static schema #371

szareiangm opened this issue Jun 26, 2018 · 11 comments

Comments

@szareiangm
Copy link
Contributor

Hello team,

We had this conversation in Gitter, but I want to make it official here. I was wondering if you folks agree to keep Postgres statics schema DDL in the Iglu server?

I saw #193 and #194 but they seem to be a bit far for now.

@szareiangm szareiangm changed the title Iglu server to keep Postgres static schema Server: keep Postgres static schema Jun 26, 2018
@alexanderdean
Copy link
Member

Can you clarify what Postgres statics schema DDL is?

@szareiangm
Copy link
Contributor Author

Like the .sql files in Iglu Central (i.e. https://github.com/snowplow/iglu-central/tree/master/sql/com.amazon.aws.cloudfront) but in Iglu Server.

@alexanderdean
Copy link
Member

alexanderdean commented Jun 26, 2018

We don't have a plan to store SQL static files inside Iglu server - the maintenance burden would be enormous. The linked tickets explain our plan around being able to generate SQL "files" on Iglu server endpoints based on schemas in the server.

@szareiangm
Copy link
Contributor Author

Yes but the links in the tickets are not working. Can I ask for the working links so that I can review the plan details, please?

@chuwy
Copy link
Contributor

chuwy commented Jun 27, 2018

Sorry @szareiangm, I afraid there's no working links for these tickets. And unfortunately we don't have any ETA on full Postgres support yet (though many open source users are interested in it). But PRs for schema-ddl can significantly speed-up this process.

@szareiangm
Copy link
Contributor Author

szareiangm commented Jun 27, 2018

Thanks @chuwy . After spending some time, I understood that Iglu-server cannot generate Redshift schema but Igluctl can generate schema for Redshift.

Postgres and Redshift have similarities, and the schema-ddl code could be reused for Postgres, too. How about this plan below?

  1. In schema-ddl: Taking out the mutual codes to a different package, called something like sql. This package might have some traits/abstractions to keep the structure the same for higher levels of the code using it. (Schema DDL: separate redshift-specific code from standard SQL #372)
  2. In schema-ddl: Add Postgres DataTypes and syntax in the new package. (Schema DDL: Add ability to generate Postgres DDL #194)
  3. In iglu-server: Add the capability to call DdlGenerator for the schema.

I will do it in steps with separate pull requests with separate issues.

cc @alexanderdean @Nafid

@chuwy
Copy link
Contributor

chuwy commented Jun 29, 2018

Hey @szareiangm,

Thanks for your interest in this and sorry for slightly delayed response. I have few points here:

I like the plan overall, but little bit hazy on first step. Although abstracting over SQL DDLs sounds like a good idea in theory, I'm not 100% sure that this is a good idea in practice. Problem is that SQL DDLs usually differ in a most important for us nuance: data type definitions. As an example, you can take a look at Snowflake AST which is quite different. And we definitely want have data types as sealed hierarchy. Same for columns definition: column options can be very different.

I'm not saying that I disagree with this step, but at least if we'll take this path we need be sure that we don't loose any preciseness on DDLs. E.g. we cannot have definitions like:

case class Column(name: String, sqlType: SqlType, sqlOptions: SqlOptions)

As its very lossy type, we won't be able to inspect it as pretty much anything will be able to extend SqlType and SqlOptions. In a slightly similar fasion though, I can imagine following definition:

case class Column[T <: SqlDdl](name: String, sql: SqlType[T], sqlOptions: SqlOptions[T])

In summary, this is a very hard trade-off between boilerplate (until now, I was leaning towards boilerplate) and possibly leaky abstractions.

One more very important point is that if we'll implement your roadmap, we'll still be missing one very important building block in Postgres support. Namely, we don't have a Postgres support in RDB Shredder (snowplow/snowplow-rdb-loader#47).

@szareiangm
Copy link
Contributor Author

@chuwy Good morning from Toronto!
What is your suggestion? Sealing the datatypes and column options inside the redshift package and have duplicate similar ones for postgres?

About Postgres support in RBD Shredder, I gotta spend some time and understand the dynamics of it. I am not sure if that blocks this plan.

@chuwy
Copy link
Contributor

chuwy commented Jun 29, 2018

Hey @szareiangm!

About Postgres support in RBD Shredder, I gotta spend some time and understand the dynamics of it. I am not sure if that blocks this plan.

Sure, please feel free to ask questions. In short: enriched events is our canonical TSV, containing all common fields (e.g. event_id, collector_tstamp etc) as well as self-describing JSONs (contexts, unstructured events). And Shredder is a Spark transformer job that transforms this heterogeneous TSV+JSON into something more friendly for particular DB (in our case Redshift) like TSV.

What is your suggestion?

Sorry, I'm not entirely sure yet. Most specific suggestion is in previous comment. However, right now I'm totally fine with boilerplate solution. This is super-important for our use case to prevent constructing following non-sense objects, like:

Column("foo", redshiftType, postgresOptions)

And current single hierarchy does not prevent it. It can be not a big deal for object-construction, but will be a huge problem for parsers.

@szareiangm
Copy link
Contributor Author

Sure, please feel free to ask questions. In short: enriched events is our canonical TSV, containing all common fields (e.g. event_id, collector_tstamp etc) as well as self-describing JSONs (contexts, unstructured events). And Shredder is a Spark transformer job that transforms this heterogeneous TSV+JSON into something more friendly for particular DB (in our case Redshift) like TSV.

Thanks for the kind explanation, invaluable and enlightening. I think that would be the next step in my plan. There is room for progress in every aspect of the codebase when we extending it to new data sinks. I will get back to you for shredding when I read its code more throughly, however for now, I have to have schema generated out of iglu-server as the first step.

Sorry, I'm not entirely sure yet. Most specific suggestion is in previous comment. However, right now I'm totally fine with boilerplate solution. This is super-important for our use case to prevent constructing following non-sense objects, like:
Column("foo", redshiftType, postgresOptions)
And current single hierarchy does not prevent it. It can be not a big deal for object-construction, but will be a huge problem for parsers.

I think since Redshift is based on Postgres 8 with some commercial additions, the difference in types and options that we are worried about are potentially minimal. However, we can decided on the solution for the potential issues as we move forward. Let's move on and and have a look at my PR for first step, and then I can submit the second step PR. I appreciate your positive and open attitude.

@aldemirenes
Copy link

Migrated to snowplow/iglu-server#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants