Support vector operations #26

akshaisarma · 2017-04-20T23:59:26Z

It is not convenient to boil your test data into a single aggregate. If you have N dimensions and M metrics, in the Hive execution engine, you have write a query for each of the M metrics and in each query, produce the metric for N dimensions boiled down to single row.

It is far more user-friendly if the user could provide a CSV representation of the expected data:

Dim1, Met1, Met2
N1,   M1,   M2,...
N2,   M1,   M2,...
N3,   M1,   M2,...
...

This could be a datasource that just reads a CSV file and makes a tabular format of it as as any datasource currently does in Validatar.

And in their asserts (assuming their expected data is in a query called R and the data being tested is in a query called Q), simply do :

Q.Met1 < R.Met1 on Q.Dim1 == R.Dim1
(approx(Q.Met2, R.Met2, 0.02) && ... ) on (Q.Dim1 == R.Dim1 || R.Dim1 == "N3")

The first assert checks that for all rows where R.Dim1 = Q.Dim1, Q.Met1 is less than R.Met1. The second one asserts that Q.Met2 is within R.Met2 by 2% for all rows where Q.Dim1 = R.Dim1 or R.Dim1 is N3.

I thought about how we implement this and it seems relatively easy. As for parsing the new assertion syntax, it's same as our Grammar right now, just separated with a new 'on' keyword -> we just need to add a new grammar level. We will need to relax our assertion framework which is currently forcing everything to one row.

If you don't provide an 'on' keyword, we will behave as we do now and force to one row. This has the nice benefit that nothing changes for existing test files already written.

The motivation for this comes from some users who contacted concerning how they would keep their dimensions and metrics in check as they grow over time.

Thoughts? Comments?

The text was updated successfully, but these errors were encountered:

joshwalters · 2017-04-21T20:12:41Z

So this would basically be performing joins on a list of expected values? Won't you have to know the column names for the Q CSV data set? It seems like the on really is a where, apply the given check where some logic holds true, correct?

This seems a complicated way to get around returning a single record and performing a check. I don't fully understand why is is not convenient to return a single aggregate record (or a single raw record) for a test. I guess I could see some usefulness in the following scenario:

Table t has columns country, agg1, agg2. You want to enforce something like agg1 < agg2 for some values of country, but not all. You could do a group by query to make a list of values, and then do the check this way.

Still, all this could be done with the previous, setup. The tests would be longer but simpler. It really falls under the simple but verbose vs complex but terse. I would side with simpler, but if someone really wants it I don't particularly mind.

akshaisarma · 2017-04-22T00:35:27Z

Yeah, where is better than on; I was just using the SQL terminology when you do a JOIN.

The problem is not just that you want to do agg1 < agg2 but also change how that relationship holds based on differing country values (maybe with a changing threshold %). When you begin enumerating your queries this way for 10-20 different country and 10-20 different aggs, you'll create 10-20 queries for each agg and each query having particular cells for each country's agg with the threshold baked into the query - each query will produce a long row with each country's agg. If one country changes, disappears or is added, you will have to change all the 10-20 queries. You could shove it all into a single query but that just essentially flattens out the expected data into a single long row or 100 - 400 cells.

It could all still be done but it just gets really verbose, prone to errors and hard to understand. At the end of the day, if validatar is meant to simplify your functional testing, I am of the opinion that taking on some of the annoyance and making it somewhat simpler to express, read and share could be really valuable to users. I'll give it a shot and see if I can't keep the implementation simple.

Thanks for the feedback!

akshaisarma added enhancement question labels Apr 20, 2017

akshaisarma self-assigned this Apr 20, 2017

akshaisarma mentioned this issue Jun 22, 2017

Support static datasources: CSV #27

Closed

akshaisarma removed the question label Jun 22, 2017

akshaisarma added this to the 0.5.0 milestone Jun 22, 2017

akshaisarma closed this as completed in f13a875 Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vector operations #26

Support vector operations #26

akshaisarma commented Apr 20, 2017 •

edited

joshwalters commented Apr 21, 2017

akshaisarma commented Apr 22, 2017 •

edited

Support vector operations #26

Support vector operations #26

Comments

akshaisarma commented Apr 20, 2017 • edited

joshwalters commented Apr 21, 2017

akshaisarma commented Apr 22, 2017 • edited

akshaisarma commented Apr 20, 2017 •

edited

akshaisarma commented Apr 22, 2017 •

edited