Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Scramble Appending #374

Open
commercial-hippie opened this issue May 27, 2019 · 8 comments
Open

Spark Scramble Appending #374

commercial-hippie opened this issue May 27, 2019 · 8 comments
Assignees
Labels

Comments

@commercial-hippie
Copy link
Contributor

I haven't tested the scramble creation using Spark but appending to a scramble generates a invalid query.

The following error happens..

mismatched input '`partner`' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 61)

caused by the following query (stripped out some columns to reduce the length)..

insert into `verdict_scrambles`.`partitioned_flattened_orc` (`partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock`) select `partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock` from `verdict_scrambles`.`verdictdbtemp_dWH1w8rA`

This is because Spark doesn't support column lists in insert statements.

So instead of:

INSERT into tablename (col1, col2, col3) VALUES ('1', '2','3')

We need to use:

INSERT into tablename VALUES ('1', '2','3')

I've currently hacked around it to get it working temporarily.

Info 1
Info 2

@pyongjoo pyongjoo self-assigned this May 27, 2019
@pyongjoo pyongjoo added the bug label May 27, 2019
@pyongjoo
Copy link
Member

pyongjoo commented Jun 5, 2019

Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested).

Let me know what you think.

@voycey
Copy link

voycey commented Jun 6, 2019 via email

@pyongjoo
Copy link
Member

pyongjoo commented Jun 6, 2019

For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road.

@voycey
Copy link

voycey commented Jun 6, 2019 via email

@pyongjoo
Copy link
Member

pyongjoo commented Jun 8, 2019

Your datasets in BQ can be accessed by a third-party service if you grant access: https://cloud.google.com/bigquery/docs/share-access-views

I'm double-checking a billing-related question, i.e., which party is charged for queries.

@voycey
Copy link

voycey commented Jun 12, 2019

I think for data provenance reasons this wont be possible - especially in light of the new privacy laws coming in from California.

We are happy to pay for the processing of our own data - we just need the tools to be able to do it :)

@pyongjoo
Copy link
Member

@voycey It would be great if you can elaborate on data provenance issue. My naive thought was if Google allows the operation (e.g., viewing data without copying them), it must be legal (under the assumption that the data provider grants).

Regarding partitioning, I think the easiest way is to let you specify partition columns for scrambles. Currently, scrambles inherit the partitioning of an original table (thus, the number exploits in combination with verdictdbblock). I believe a combination of (date, verdictdbblock) should good enough for ensuring speed (without state).

In the future, we can reduce the total number of verdictdbblock down to 10 or so, by introducing a clever scheme, but it will take some time (since we are transitioning...)

@voycey
Copy link

voycey commented Jun 13, 2019

There are many use cases where this would be great, however when dealing with user information allowing 3rd party access to this data you are essentially not only opening up a secondary vector but you are also technically sharing information without notifying the users, this would breach GDPR and the CCPA (I'm sure there are load of other reasons why this wouldn't be allowed as well but that is the first one that pops into my head).

I think if we can just handle the partitioning by date that will be great for this - as the data is only 5% of the total there is little need to have a full granular partitioning scheme.

In the meantime - generating scrambles (in whatever fashion) in BigQuery would be great - BQ simply handles the capacity requirements and they would probably complete very quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants