Better error handling for errors in the execute() stage #191

ropeladder · 2022-09-02T01:48:04Z

Is your feature request related to a problem? Please describe.
When my code runs into errors in the execute() stage, the traceback just goes to the execute() function, and doesn't provide any useful feedback about the provenance of the error. For example, if my DAG feeds a series into another series and they have incompatible indexes, I get an index error but I can't tell which series caused.

Describe the solution you'd like
Just passing back the names of the relevant series with the error would helpful.

Describe alternatives you've considered
More involved type checking when constructing the series functions and then when building the DAG could potentially eliminate some of the errors (probably not all though).

The text was updated successfully, but these errors were encountered:

elijahbenizzy · 2022-09-03T15:35:09Z

So I think that check_output will be really helpful here.

But I see a few sets of problems that we can solve:

Making it easy to debug/isolate problems that come up during execution
Ensuring compatibilities between nodes, more specific than just "these are both series"
Ensuring that the indices match/node values are compatible at final results building.

I think (1) is mechanical -- easy enough to add better debugging messages, catching errors along the way. (2) is probably done best with check_output, and (3) could be done in the resuls_builder.

Any chance you have an example of the type of unclear message you're referencing here?

skrawcz · 2022-09-04T20:20:18Z

@ropeladder to be test driven about this what other tests would be relevant to add:

Should fail & have clear error messages:

creating a dataframe from two or more series with different index types. E.g. date index vs integer index.

Should not fail -- but maybe should warn?

when index ranges don't overlap.

Anything else?

ropeladder · 2022-09-05T22:31:58Z

@skrawcz: These probably fall under different index types, but other specific things to check for would be:

DatetimeIndex vs PeriodIndex
different frequencies

@elijahbenizzy: On (3) it's definitely worth clarifying this in the code and in the docs. I was blissfully trying to output series of different frequencies until it occurred to me that the output was all going into the same dataframe. At which point I just separated my execute() calls into separate ones for each frequency, which feels a bit hacky.

I'll try and post some examples tomorrow.

skrawcz · 2022-09-06T06:57:22Z

On (3) it's definitely worth clarifying this in the code and in the docs. I was blissfully trying to output series of different frequencies until it occurred to me that the output was all going into the same dataframe. At which point I just separated my execute() calls into separate ones for each frequency, which feels a bit hacky.

Interesting. 🤔 Just to clarify do you want one dataframe (with a unified index?) Or separate dataframes?

The Hamilton default assumes you want to create a single dataframe when you do .execute() but as @elijahbenizzy says, can be changed with our "result builder" abstraction. Otherwise do you have concerns running .execute() twice with different columns to create two dataframes?

ropeladder · 2022-09-06T19:19:26Z

The output is workable as is, and I think using tags it could be a bit cleaner than what I've been doing. On the other hand, having .execute() automatically separate output by index and return everything cleanly in multiple dataframes, or maybe having an option to coerce/resample to a specific database would be useful, instead of throwing an error or just mushing things together.

Error example: (Note: it only seems to happen intermittently)
(Also note: when it does work it's a good example of where warnings might be helpful, e.g. adding two series with different indexes may just give a new series filled with NaNs)

import importlib
from hamilton import driver
import pandas as pd

 
initial_columns = {
	"series1": pd.Series([1,2,3], index=pd.DatetimeIndex(['2022-01','2022-02','2022-03'], freq='MS')),
	"series2": pd.Series([4,5,6], index=pd.PeriodIndex(year=[2022, 2022, 2022], month=[1,2,3], freq='M')),
	"series3": pd.Series([4,5,6], index=pd.PeriodIndex(year=[2022, 2022, 2022], month=[1,1,1], day=[3,4,5], freq='B')),
	"series4": pd.Series([4,5,6], index=pd.PeriodIndex(year=[2022, 2022, 2022], month=[1,1,1], day=[4,11,18], freq='W')),
}

module_name = 'funs_ex'
module = importlib.import_module(module_name)

dr = driver.Driver(initial_columns, module)

output_columns = {'series1', 'series2', 'series3', 'series4', 'series5'}

df = dr.execute(output_columns) 

print(df['series1'])
print(df['series2'])
print(df['series3'])
print(df['series4'])
print(df['series5'])

with:

import pandas as pd

def series5(series1: pd.Series, series2: pd.Series) -> pd.Series:
	# return pd.concat([series1, series2])
    return series1 + series2

def series6(series1: pd.Series, series3: pd.Series) -> pd.Series:
	# return pd.concat([series1, series2])
    return series1.fillna(series3.resample('BM').ffill())

Errors: (when it doesn't run)

(traceback to `.execute()` with no info about which columns actually cause the error)
pandas._libs.tslibs.period.IncompatibleFrequency: Input has different freq=M from Period(freq=W-SUN)

Warnings: (when it does run it still gives this warning)

.../python3.9/site-packages/hamilton/base.py:52: RuntimeWarning: '<' not supported between instances of 'Period' and 'Timestamp', sort order is undefined for incomparable objects.
  return pd.DataFrame(outputs)

elijahbenizzy · 2022-09-07T20:32:10Z

One thing to consider (not sure if you've seen this) is raw_execute, which doesn't do a join. Under the hood, execute just calls raw_execute and does a join for you. So, you can use that, then join in a custom manner (in multiple dataframes as you desire). Its an easier way to customize than building an entire ResultsBuilder.

skrawcz · 2022-09-09T12:29:21Z

Thanks for the tests.

So in your example, naively creating a dataframe from the initial columns does work but the result isn't useful:

df = pd.DataFrame(initial_columns)

outputs:

<input>:1: RuntimeWarning: '<' not supported between instances of 'Period' and 'Timestamp', sort order is undefined for incomparable objects.
                       series1  series2  series3  series4
2022-01-01 00:00:00        1.0      NaN      NaN      NaN
2022-02-01 00:00:00        2.0      NaN      NaN      NaN
2022-03-01 00:00:00        3.0      NaN      NaN      NaN
2022-01                    NaN      4.0      NaN      NaN
2022-02                    NaN      5.0      NaN      NaN
2022-03                    NaN      6.0      NaN      NaN
2022-01-03                 NaN      NaN      4.0      NaN
2022-01-04                 NaN      NaN      5.0      NaN
2022-01-05                 NaN      NaN      6.0      NaN
2022-01-03/2022-01-09      NaN      NaN      NaN      4.0
2022-01-10/2022-01-16      NaN      NaN      NaN      5.0
2022-01-17/2022-01-23      NaN      NaN      NaN      6.0

What should happen here? Adding a warning here is pretty easy as is erroring.

I guess the question is, do we trust Panda's index coalescing? Since it does seem to work (e.g. int -> float), but for time related ones, less so as in this example...

ropeladder · 2022-09-15T15:28:08Z

Maybe a warning that listed all the index "groups" (e.g. series indices that had any overlap with over series indices). I could also see an option to output different index "groups" as separate dataframes being helpful.

This is partly why I mentioned more elaborate type checking initially as an alternative -- because you'd be forced to be more explicit about what indices you expect, so e.g. hamilton could tell you straightaway that you're trying to output a dataframe with incompatible indices.

The way I'm currently using hamilton is to try converting some legacy code from a different language, that has outputs that use a bunch of different frequencies. I'm finding that instead of just going through and creating one output where I can get any of the values that I want via column name, I have to divide them into subgroups based on their index types (and possibly date ranges, where joining the columns into a DataFrame adds in NaNs that I don't want).

In other words, raw_execute() is the more intuitive functionality for me. The joining that execute() does feels like it adds a lot of overhead I'm not expecting, since hamilton is generally focused on columns.

ropeladder · 2022-09-15T15:37:32Z

Also: I just ran into another example of an error when (unthinkingly) trying to execute() and output both standard pd.Series and a pd.index into the final dataframe:

def series1() -> pd.Series:
    return pd.Series([1,2,3])

def series2() -> pd.PeriodIndex:
    return pd.Series([4,5]).index

ValueError: array length <x> does not match index length <y>

(The fact that it errors is fine, it's just that it doesn't say anything about the columns that cause it and the traceback just goes to execute())

skrawcz · 2022-09-20T00:19:09Z

Okay my goal is to have something by the end of this week barring no major disruptions that will minimally:

Warn if there are index type mismatches.
Require you to set your logger to debug if you want to see more details.
Provide a "ResultBuilder" class that uses strict index type matching so if you want to error on index type mismatches, this is the results builder to use.

Related to issue #191, this commit is to help surface index type issues. Specifically: 1. Warn if there are index type mismatches. 2. Require you to set your logger to debug if you want to see more details. 3. Provide a "ResultBuilder" class that uses strict index type matching so if you want to error on index type mismatches, this is the results builder to use. I don't think we should build anything more custom unless there's a clear common use case - user contributed result builders sound like an interesting idea.

skrawcz · 2022-10-28T21:47:23Z

@ropeladder I'm going to close this since with 1.11.0 we now provide warnings and the option to use a strict index dataframe builder. Please re-open if there's more we could do.

skrawcz added the enhancement New feature or request label Sep 2, 2022

skrawcz self-assigned this Sep 20, 2022

skrawcz mentioned this issue Sep 22, 2022

Add pandas index checks #200

Merged

14 tasks

skrawcz closed this as completed Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error handling for errors in the execute() stage #191

Better error handling for errors in the execute() stage #191

ropeladder commented Sep 2, 2022

elijahbenizzy commented Sep 3, 2022

skrawcz commented Sep 4, 2022 •

edited

ropeladder commented Sep 5, 2022

skrawcz commented Sep 6, 2022

ropeladder commented Sep 6, 2022

elijahbenizzy commented Sep 7, 2022

skrawcz commented Sep 9, 2022

ropeladder commented Sep 15, 2022

ropeladder commented Sep 15, 2022

skrawcz commented Sep 20, 2022

skrawcz commented Oct 28, 2022

Better error handling for errors in the execute() stage #191

Better error handling for errors in the execute() stage #191

Comments

ropeladder commented Sep 2, 2022

elijahbenizzy commented Sep 3, 2022

skrawcz commented Sep 4, 2022 • edited

ropeladder commented Sep 5, 2022

skrawcz commented Sep 6, 2022

ropeladder commented Sep 6, 2022

elijahbenizzy commented Sep 7, 2022

skrawcz commented Sep 9, 2022

ropeladder commented Sep 15, 2022

ropeladder commented Sep 15, 2022

skrawcz commented Sep 20, 2022

skrawcz commented Oct 28, 2022

skrawcz commented Sep 4, 2022 •

edited