Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vector operations #26

Closed
akshaisarma opened this issue Apr 20, 2017 · 2 comments
Closed

Support vector operations #26

akshaisarma opened this issue Apr 20, 2017 · 2 comments
Assignees
Milestone

Comments

@akshaisarma
Copy link
Contributor

akshaisarma commented Apr 20, 2017

It is not convenient to boil your test data into a single aggregate. If you have N dimensions and M metrics, in the Hive execution engine, you have write a query for each of the M metrics and in each query, produce the metric for N dimensions boiled down to single row.

It is far more user-friendly if the user could provide a CSV representation of the expected data:

Dim1, Met1, Met2
N1,   M1,   M2,...
N2,   M1,   M2,...
N3,   M1,   M2,...
...

This could be a datasource that just reads a CSV file and makes a tabular format of it as as any datasource currently does in Validatar.

And in their asserts (assuming their expected data is in a query called R and the data being tested is in a query called Q), simply do :

  • Q.Met1 < R.Met1 on Q.Dim1 == R.Dim1
  • (approx(Q.Met2, R.Met2, 0.02) && ... ) on (Q.Dim1 == R.Dim1 || R.Dim1 == "N3")

The first assert checks that for all rows where R.Dim1 = Q.Dim1, Q.Met1 is less than R.Met1. The second one asserts that Q.Met2 is within R.Met2 by 2% for all rows where Q.Dim1 = R.Dim1 or R.Dim1 is N3.

I thought about how we implement this and it seems relatively easy. As for parsing the new assertion syntax, it's same as our Grammar right now, just separated with a new 'on' keyword -> we just need to add a new grammar level. We will need to relax our assertion framework which is currently forcing everything to one row.

If you don't provide an 'on' keyword, we will behave as we do now and force to one row. This has the nice benefit that nothing changes for existing test files already written.

The motivation for this comes from some users who contacted concerning how they would keep their dimensions and metrics in check as they grow over time.

Thoughts? Comments?

@joshwalters
Copy link
Contributor

So this would basically be performing joins on a list of expected values? Won't you have to know the column names for the Q CSV data set? It seems like the on really is a where, apply the given check where some logic holds true, correct?

This seems a complicated way to get around returning a single record and performing a check. I don't fully understand why is is not convenient to return a single aggregate record (or a single raw record) for a test. I guess I could see some usefulness in the following scenario:

Table t has columns country, agg1, agg2. You want to enforce something like agg1 < agg2 for some values of country, but not all. You could do a group by query to make a list of values, and then do the check this way.

Still, all this could be done with the previous, setup. The tests would be longer but simpler. It really falls under the simple but verbose vs complex but terse. I would side with simpler, but if someone really wants it I don't particularly mind.

@akshaisarma
Copy link
Contributor Author

akshaisarma commented Apr 22, 2017

Yeah, where is better than on; I was just using the SQL terminology when you do a JOIN.

The problem is not just that you want to do agg1 < agg2 but also change how that relationship holds based on differing country values (maybe with a changing threshold %). When you begin enumerating your queries this way for 10-20 different country and 10-20 different aggs, you'll create 10-20 queries for each agg and each query having particular cells for each country's agg with the threshold baked into the query - each query will produce a long row with each country's agg. If one country changes, disappears or is added, you will have to change all the 10-20 queries. You could shove it all into a single query but that just essentially flattens out the expected data into a single long row or 100 - 400 cells.

It could all still be done but it just gets really verbose, prone to errors and hard to understand. At the end of the day, if validatar is meant to simplify your functional testing, I am of the opinion that taking on some of the annoyance and making it somewhat simpler to express, read and share could be really valuable to users. I'll give it a shot and see if I can't keep the implementation simple.

Thanks for the feedback!

@akshaisarma akshaisarma added this to the 0.5.0 milestone Jun 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants