Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multi model columns #76

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

smasoka
Copy link
Contributor

@smasoka smasoka commented Oct 30, 2020

Draft PR to handle multi-column expressions.

@ratt-priv-ci
Copy link
Collaborator

Can one of the admins verify this patch?

@bennahugo
Copy link
Collaborator

Hi can we please not rely on a branch in dependencies. Either a point release is needed or a specific commit checksum is needed

@sjperkins
Copy link
Member

Hi can we please not rely on a branch in dependencies. Either a point release is needed or a specific commit checksum is needed

Yep, the intention is to try things out in this branch until we're happy and then release dask-ms

@bennahugo bennahugo added the wip Work in progress label Oct 30, 2020
@bennahugo bennahugo changed the title Multi model columns WIP: Multi model columns Oct 30, 2020
@bennahugo bennahugo marked this pull request as draft October 30, 2020 11:47
@sjperkins
Copy link
Member

sjperkins commented Nov 2, 2020

I think the changes are more complicated than they need to be because they implement some of the expression parsing behaviour in data_colum_expr. In particular it will break on residual computation i.e.: DATA - MODEL_DATA.

In general I think the change could be simplified to the following general idea:

parts = args.data_column.split("=")

if parts == 1:
   flag_data = parts[0]
else:
  data_col = parts[0]
  nds = data_column_expr(args.data_column, datasets)
  flag_data = getattr(nds, data_col)

Edit: Corrections

@bennahugo
Copy link
Collaborator

Instead of reimplementing a parse tree that obeys botmath rules to evaluate this why not use numexpr with a blockwise map on the data? Parse out the column names between the operators using regex as you have now and assign them to variables then map numexpr.evaluate over the arrays. Assumption that the arrays have the same shape can be made

In [1]: import numexpr

In [2]: numexpr.evaluate("5+2")
Out[2]: array(7, dtype=int32)

In [3]: import numpy as np

In [4]: a = np.ones((100,64,4))

In [5]: b = np.ones((100,64,4))

In [6]: numexpr.evaluate("a+b")

@sjperkins
Copy link
Member

Instead of reimplementing a parse tree that obeys botmath rules to evaluate this why not use numexpr with a blockwise map on the data? Parse out the column names between the operators using regex as you have now and assign them to variables then map numexpr.evaluate over the arrays. Assumption that the arrays have the same shape can be made

In [1]: import numexpr

In [2]: numexpr.evaluate("5+2")
Out[2]: array(7, dtype=int32)

In [3]: import numpy as np

In [4]: a = np.ones((100,64,4))

In [5]: b = np.ones((100,64,4))

In [6]: numexpr.evaluate("a+b")

Well the parse tree is already implemented in plain python -- no need to add an extra dependency which unfortunately only has one maintainer.

@sjperkins
Copy link
Member

sjperkins commented Nov 2, 2020

Also, I should add that the parse tree evaluates bodmas expressions involving dask arrays

@bennahugo
Copy link
Collaborator

bennahugo commented Nov 2, 2020

ok but as I see it it evaluates a limited subset of expressions, for example it is assumed that there is one column in front of the division operator, so what if I have corrected residuals that I would want to flag a quotient on?
Then I need to specify to add back the sources
(CORRECTED_DATA + DE_MODEL1 ) / MODEL_DATA

Just an example though - in this case the data was calibrated to MODEL_DATA - DE_MODEL1:DE_MODEL1 as the DI and DD directions.

@sjperkins
Copy link
Member

ok but as I see it it evaluates a limited subset of expressions, for example it is assumed that there is one column in front of the division operator, so what if I have corrected residuals that I would want to flag a quotient on?

Currently yes, the PR needs to change to depend more fully on data_column_expr.

Then I need to specify to add back the sources
(CORRECTED_DATA + DE_MODEL1 ) / MODEL_DATA

data_column_expr should handle this: https://dask-ms.readthedocs.io/en/latest/api.html#daskms.expressions.data_column_expr

Just an example though - in this case the data was calibrated to MODEL_DATA - DE_MODEL1:DE_MODEL1 as the DI and DD directions.

@smasoka
Copy link
Contributor Author

smasoka commented Nov 2, 2020

The regex split is mainly for

xds = list(xds_from_ms(args.ms,
                           columns=tuple(columns),
                           ...................

which is faster than reading the entire ms, versus splitting for data_column_expr which handles the expression well.
I think a good re.split can handle this and support cases such as DATA-MODEL and DATA / (DIR1_DATA + DIR2_DATA + MODEL_DATA) at once.

@sjperkins
Copy link
Member

The regex split is mainly for

xds = list(xds_from_ms(args.ms,
                           columns=tuple(columns),
                           ...................

Ah I see, you need to read the MS to check whether the columns in the expression exist?

which is faster than reading the entire ms, versus splitting for data_column_expr which handles the expression well.
I think a good re.split can handle this and support cases such as DATA-MODEL and DATA / (DIR1_DATA + DIR2_DATA + MODEL_DATA) at once.

Just to be clear, because I think there's confusion here. data_column_expr should be able to handle virtually any simple calculator expression (BODMAS) involving data columns, so there shouldn't be any need for regex in the tricolour argument handling.

The reason for the split is to distinguish between the two --data-column cases:

1.It contains an expression (detectable by the presence of "=" in the string), in which case args.data_column can be used as input to data_column_expr.
2. a standard data column.

@sjperkins
Copy link
Member

@bennahugo Regarding expressions containing division operations, what should happen in the case of zeroed/inf/nan visibilities?

@bennahugo
Copy link
Collaborator

There is an explicit flagging step for this

@bennahugo
Copy link
Collaborator

(Those visibilities are in error)

2. extract visibility column name to support substract_model_column
3. removed columns arg from xds_from_ms to read entire ms
4. add warning for -smc arg depreciation
@smasoka smasoka marked this pull request as ready for review November 13, 2020 10:36
tricolour/apps/tricolour/app.py Outdated Show resolved Hide resolved
tricolour/apps/tricolour/app.py Outdated Show resolved Hide resolved
string_columns = None
# extract the name of the visibility column
# to support subtract_model_column
data_column = re.split(r'\+|-|/|\*|\(|\)', args.data_column)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regex designed to detect the kind of expressions that https://github.com/ska-sa/dask-ms/blob/master/daskms/expressions.py#L103 supports. I think it's probably better to just split on the equals as follows as its easer to separate into the two cases

parts = args.data_column.split("=")
log.info("Flagging on the {0:s} {0:s}".format(parts[0], "column" if len(parts) == 1 else "expression))

Additionally, I don't think the regex contained an equals sign so I think it would have failed on an assignment statement.

Could you please correct and test the two cases?

  1. The simple case where data_column just references a column name
  2. The complicated case where data_column contains an assignment.

tricolour/apps/tricolour/app.py Outdated Show resolved Hide resolved
@sjperkins
Copy link
Member

Is this ready for review? The CI seems to be failing.

@smasoka
Copy link
Contributor Author

smasoka commented Jan 8, 2021

@IanHeywood do you mind trying this PR branch and give your initial feedback?

@IanHeywood
Copy link
Collaborator

I installed the multi-model_columns branch. I can't get it to work, but maybe I just don't understand the syntax required.

tricolour - 2021-01-15 23:15:29,339 INFO - Adding field 'CDFS_2_4' scan 3 to compute graph for processing
Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 381, in __getattr__
    return self._attrs[name]
KeyError: 'DATA-MODEL_DATA'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 373, in _main
    nrow, nchan, ncorr = getattr(ds, data_column).data.shape
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 383, in __getattr__
    raise AttributeError("Invalid Attribute %s" % name)
AttributeError: Invalid Attribute DATA-MODEL_DATA
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py(383)__getattr__()
-> raise AttributeError("Invalid Attribute %s" % name)
(Pdb)

I tried -dc and --data-column, with and without the = sign, and also + instead of -.

@smasoka
Copy link
Contributor Author

smasoka commented Jan 18, 2021

-dc arg is based on daskms expression. So something like this -dc "FLAG_DATA = DATA - MODEL_DATA" should work for expressions and simply -dc DATA for a column.

@smasoka smasoka mentioned this pull request Jan 26, 2021
@sjperkins
Copy link
Member

Mental note, we mustn't merge this before:

  • A new dask-ms release is put out with the data column expression functionality in it
  • The new dask-ms master branch is replaced with the new dask-ms version in this PR

@IanHeywood
Copy link
Collaborator

Trying this out again. Using quotations marks as per the example and docs doesn't work, e.g.

-dc "FLAG_DATA = DATA - MODEL_DATA"

gives:

Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 381, in __getattr__
    return self._attrs[name]
KeyError: 'FLAG_DATA '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 373, in _main
    nrow, nchan, ncorr = getattr(ds, data_column).data.shape
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 383, in __getattr__
    raise AttributeError("Invalid Attribute %s" % name)
AttributeError: Invalid Attribute FLAG_DATA
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py(383)__getattr__()
-> raise AttributeError("Invalid Attribute %s" % name)

If I provide an expression with no spaces and no quotations marks, e.g.:

-dc FLAG_DATA=DATA-MODEL_DATA

starts flagging. Although I can't vouch for what's going on under the hood.

Does FLAG_DATA end up being a new column in the MS, or is that just an intermediate product that needs to be given a name and is discarded once the flags are written?

I'm more confident when doing this sort of thing if I can use parentheses to be sure of the ordering, as is also shown in the dask documentation example. But bash doesn't like this unless I use quotes, e.g.:

-dc "FLAG=DATA-(MODEL_DATA-DIR1_DATA)"

at which point this happens:

Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 297, in _main
    xds = data_column_expr(args.data_column, xds)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 140, in data_column_expr
    new_datasets = v.visit(expr)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 33, in visit_Assign
    values = self.visit(node.value)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 66, in visit_BinOp
    right = self.visit(node.right)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 66, in visit_BinOp
    right = self.visit(node.right)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 93, in visit_Name
    raise ValueError(f"{xdarray} does not look "
ValueError: <daskms.dataset.Variable object at 0x7f16440c17f0> does not look like a valid DATA array. Should have (row, chan, corr) dimsInstead has ('row', 'DIR1_DATA-1', 'DIR1_DATA-2')
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py(93)visit_Name()
-> raise ValueError(f"{xdarray} does not look "

If my slowness to engage means that the "This branch is out-of-date with the base branch" is relevant to the above, and I need to try a newer version, then please let me know and I'll have a go.

Cheers.

@sjperkins
Copy link
Member

Thanks for the feedback @IanHeywood.

I agree that having an assigment statement (--data-column="FLAG_DATA=X + Y") is overkill on the command line. I've modified dask-ms to accept expressions instead of assigments and tricolour so that --data-column="X + Y" now runs through.

--data-columnn=DATA also runs through for me.

Would you mind having another go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants