WIP: Multi model columns #76

smasoka · 2020-10-30T09:52:50Z

Draft PR to handle multi-column expressions.

ratt-priv-ci · 2020-10-30T09:54:10Z

Can one of the admins verify this patch?

bennahugo · 2020-10-30T10:53:32Z

Hi can we please not rely on a branch in dependencies. Either a point release is needed or a specific commit checksum is needed

sjperkins · 2020-10-30T11:02:20Z

Hi can we please not rely on a branch in dependencies. Either a point release is needed or a specific commit checksum is needed

Yep, the intention is to try things out in this branch until we're happy and then release dask-ms

sjperkins · 2020-11-02T08:06:25Z

I think the changes are more complicated than they need to be because they implement some of the expression parsing behaviour in data_colum_expr. In particular it will break on residual computation i.e.: DATA - MODEL_DATA.

In general I think the change could be simplified to the following general idea:

parts = args.data_column.split("=")

if parts == 1:
   flag_data = parts[0]
else:
  data_col = parts[0]
  nds = data_column_expr(args.data_column, datasets)
  flag_data = getattr(nds, data_col)

Edit: Corrections

bennahugo · 2020-11-02T08:43:42Z

Instead of reimplementing a parse tree that obeys botmath rules to evaluate this why not use numexpr with a blockwise map on the data? Parse out the column names between the operators using regex as you have now and assign them to variables then map numexpr.evaluate over the arrays. Assumption that the arrays have the same shape can be made

In [1]: import numexpr

In [2]: numexpr.evaluate("5+2")
Out[2]: array(7, dtype=int32)

In [3]: import numpy as np

In [4]: a = np.ones((100,64,4))

In [5]: b = np.ones((100,64,4))

In [6]: numexpr.evaluate("a+b")

sjperkins · 2020-11-02T08:45:07Z

Instead of reimplementing a parse tree that obeys botmath rules to evaluate this why not use numexpr with a blockwise map on the data? Parse out the column names between the operators using regex as you have now and assign them to variables then map numexpr.evaluate over the arrays. Assumption that the arrays have the same shape can be made
In [1]: import numexpr

In [2]: numexpr.evaluate("5+2")
Out[2]: array(7, dtype=int32)

In [3]: import numpy as np

In [4]: a = np.ones((100,64,4))

In [5]: b = np.ones((100,64,4))

In [6]: numexpr.evaluate("a+b")

Well the parse tree is already implemented in plain python -- no need to add an extra dependency which unfortunately only has one maintainer.

sjperkins · 2020-11-02T08:49:22Z

Also, I should add that the parse tree evaluates bodmas expressions involving dask arrays

bennahugo · 2020-11-02T09:06:00Z

ok but as I see it it evaluates a limited subset of expressions, for example it is assumed that there is one column in front of the division operator, so what if I have corrected residuals that I would want to flag a quotient on?
Then I need to specify to add back the sources
(CORRECTED_DATA + DE_MODEL1 ) / MODEL_DATA

Just an example though - in this case the data was calibrated to MODEL_DATA - DE_MODEL1:DE_MODEL1 as the DI and DD directions.

sjperkins · 2020-11-02T09:17:29Z

ok but as I see it it evaluates a limited subset of expressions, for example it is assumed that there is one column in front of the division operator, so what if I have corrected residuals that I would want to flag a quotient on?

Currently yes, the PR needs to change to depend more fully on data_column_expr.

Then I need to specify to add back the sources
(CORRECTED_DATA + DE_MODEL1 ) / MODEL_DATA

data_column_expr should handle this: https://dask-ms.readthedocs.io/en/latest/api.html#daskms.expressions.data_column_expr

Just an example though - in this case the data was calibrated to MODEL_DATA - DE_MODEL1:DE_MODEL1 as the DI and DD directions.

smasoka · 2020-11-02T10:02:02Z

The regex split is mainly for

xds = list(xds_from_ms(args.ms,
                           columns=tuple(columns),
                           ...................

which is faster than reading the entire ms, versus splitting for data_column_expr which handles the expression well.
I think a good re.split can handle this and support cases such as DATA-MODEL and DATA / (DIR1_DATA + DIR2_DATA + MODEL_DATA) at once.

sjperkins · 2020-11-02T10:09:31Z

The regex split is mainly for

xds = list(xds_from_ms(args.ms,
                           columns=tuple(columns),
                           ...................

Ah I see, you need to read the MS to check whether the columns in the expression exist?

which is faster than reading the entire ms, versus splitting for data_column_expr which handles the expression well.
I think a good re.split can handle this and support cases such as DATA-MODEL and DATA / (DIR1_DATA + DIR2_DATA + MODEL_DATA) at once.

Just to be clear, because I think there's confusion here. data_column_expr should be able to handle virtually any simple calculator expression (BODMAS) involving data columns, so there shouldn't be any need for regex in the tricolour argument handling.

The reason for the split is to distinguish between the two --data-column cases:

1.It contains an expression (detectable by the presence of "=" in the string), in which case args.data_column can be used as input to data_column_expr.
2. a standard data column.

sjperkins · 2020-11-02T11:32:11Z

@bennahugo Regarding expressions containing division operations, what should happen in the case of zeroed/inf/nan visibilities?

bennahugo · 2020-11-02T12:50:47Z

There is an explicit flagging step for this

bennahugo · 2020-11-02T12:51:51Z

(Those visibilities are in error)

2. extract visibility column name to support substract_model_column 3. removed columns arg from xds_from_ms to read entire ms 4. add warning for -smc arg depreciation

tricolour/apps/tricolour/app.py

sjperkins · 2021-01-04T16:03:36Z

tricolour/apps/tricolour/app.py

-        string_columns = None
+    # extract the name of the visibility column
+    # to support subtract_model_column
+    data_column = re.split(r'\+|-|/|\*|\(|\)', args.data_column)[0]


This looks like a regex designed to detect the kind of expressions that https://github.com/ska-sa/dask-ms/blob/master/daskms/expressions.py#L103 supports. I think it's probably better to just split on the equals as follows as its easer to separate into the two cases

parts = args.data_column.split("=") log.info("Flagging on the {0:s} {0:s}".format(parts[0], "column" if len(parts) == 1 else "expression))

Additionally, I don't think the regex contained an equals sign so I think it would have failed on an assignment statement.

Could you please correct and test the two cases?

The simple case where data_column just references a column name

The complicated case where data_column contains an assignment.

tricolour/apps/tricolour/app.py

sjperkins · 2021-01-07T15:29:08Z

Is this ready for review? The CI seems to be failing.

smasoka · 2021-01-08T09:44:18Z

@IanHeywood do you mind trying this PR branch and give your initial feedback?

IanHeywood · 2021-01-15T23:18:53Z

I installed the multi-model_columns branch. I can't get it to work, but maybe I just don't understand the syntax required.

tricolour - 2021-01-15 23:15:29,339 INFO - Adding field 'CDFS_2_4' scan 3 to compute graph for processing
Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 381, in __getattr__
    return self._attrs[name]
KeyError: 'DATA-MODEL_DATA'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 373, in _main
    nrow, nchan, ncorr = getattr(ds, data_column).data.shape
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 383, in __getattr__
    raise AttributeError("Invalid Attribute %s" % name)
AttributeError: Invalid Attribute DATA-MODEL_DATA
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py(383)__getattr__()
-> raise AttributeError("Invalid Attribute %s" % name)
(Pdb)

I tried -dc and --data-column, with and without the = sign, and also + instead of -.

smasoka · 2021-01-18T11:30:41Z

-dc arg is based on daskms expression. So something like this -dc "FLAG_DATA = DATA - MODEL_DATA" should work for expressions and simply -dc DATA for a column.

sjperkins · 2021-01-27T06:35:06Z

Mental note, we mustn't merge this before:

A new dask-ms release is put out with the data column expression functionality in it
The new dask-ms master branch is replaced with the new dask-ms version in this PR

IanHeywood · 2021-02-10T17:09:25Z

Trying this out again. Using quotations marks as per the example and docs doesn't work, e.g.

-dc "FLAG_DATA = DATA - MODEL_DATA"

gives:

Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 381, in __getattr__
    return self._attrs[name]
KeyError: 'FLAG_DATA '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 373, in _main
    nrow, nchan, ncorr = getattr(ds, data_column).data.shape
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py", line 383, in __getattr__
    raise AttributeError("Invalid Attribute %s" % name)
AttributeError: Invalid Attribute FLAG_DATA
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/dataset.py(383)__getattr__()
-> raise AttributeError("Invalid Attribute %s" % name)

If I provide an expression with no spaces and no quotations marks, e.g.:

-dc FLAG_DATA=DATA-MODEL_DATA

starts flagging. Although I can't vouch for what's going on under the hood.

Does FLAG_DATA end up being a new column in the MS, or is that just an intermediate product that needs to be given a name and is discarded once the flags are written?

I'm more confident when doing this sort of thing if I can use parentheses to be sure of the ordering, as is also shown in the dask documentation example. But bash doesn't like this unless I use quotes, e.g.:

-dc "FLAG=DATA-(MODEL_DATA-DIR1_DATA)"

at which point this happens:

Unexpected error. Dropping you into pdb for a post-mortem.
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/tricolour_multicol/bin/tricolour", line 11, in <module>
    load_entry_point('tricolour', 'console_scripts', 'tricolour')()
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 261, in main
    _main(args)
  File "/mnt/home/ianh/Software/tricolour_multicol/tricolour/apps/tricolour/app.py", line 297, in _main
    xds = data_column_expr(args.data_column, xds)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 140, in data_column_expr
    new_datasets = v.visit(expr)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 33, in visit_Assign
    values = self.visit(node.value)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 66, in visit_BinOp
    right = self.visit(node.right)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 66, in visit_BinOp
    right = self.visit(node.right)
  File "/usr/lib/python3.6/ast.py", line 253, in visit
    return visitor(node)
  File "/mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py", line 93, in visit_Name
    raise ValueError(f"{xdarray} does not look "
ValueError: <daskms.dataset.Variable object at 0x7f16440c17f0> does not look like a valid DATA array. Should have (row, chan, corr) dimsInstead has ('row', 'DIR1_DATA-1', 'DIR1_DATA-2')
> /mnt/home/ianh/venv/tricolour_multicol/lib/python3.6/site-packages/daskms/expressions.py(93)visit_Name()
-> raise ValueError(f"{xdarray} does not look "

If my slowness to engage means that the "This branch is out-of-date with the base branch" is relevant to the above, and I need to try a newer version, then please let me know and I'll have a go.

Cheers.

This reverts commit 86eeefb.

sjperkins · 2021-02-11T13:26:58Z

Thanks for the feedback @IanHeywood.

I agree that having an assigment statement (--data-column="FLAG_DATA=X + Y") is overkill on the command line. I've modified dask-ms to accept expressions instead of assigments and tricolour so that --data-column="X + Y" now runs through.

--data-columnn=DATA also runs through for me.

Would you mind having another go?

smasoka added 2 commits October 30, 2020 12:10

support multi-column expression and depency on daskms master

e44846d

remove sothing.py

5050a43

smasoka force-pushed the multi-model_columns branch from e965cfc to 5050a43 Compare October 30, 2020 10:11

bennahugo added the wip Work in progress label Oct 30, 2020

bennahugo changed the title ~~Multi model columns~~ WIP: Multi model columns Oct 30, 2020

bennahugo marked this pull request as draft October 30, 2020 11:47

1. update argument help messages

d18ccbe

2. extract visibility column name to support substract_model_column 3. removed columns arg from xds_from_ms to read entire ms 4. add warning for -smc arg depreciation

smasoka marked this pull request as ready for review November 13, 2020 10:36

sjperkins requested changes Jan 4, 2021

View reviewed changes

supports assignment (=) in expression

155c244

smasoka and others added 2 commits January 7, 2021 18:33

fix logic

918a366

Fix infinite recursion when trying to determine blockwise meta

0bce875

smasoka mentioned this pull request Jan 14, 2021

SPW selection (or relaxation of assumptions) required #77

Open

adds test for expression and strip data_column

f961d61

flake8 fixup

5c38afc

smasoka mentioned this pull request Jan 18, 2021

Upstream breakage in master #78

Closed

smasoka mentioned this pull request Jan 26, 2021

ddid fixes #79

Merged

sjperkins added 7 commits February 11, 2021 13:39

Rework with upstream dask changes

6725ad1

Merge remote-tracking branch 'upstream/master' into multi-model_columns

2813d09

rework test fixtures so that download only occurs once

48f8e73

try the column case before the expression case

01263b3

provoke a test run

86eeefb

Revert "provoke a test run"

de75c38

This reverts commit 86eeefb.

flake8

e3195d6

Update help

9af1b4e

sjperkins mentioned this pull request Mar 2, 2021

add/subtract from existing column caracal-pipeline/crystalball#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multi model columns #76

WIP: Multi model columns #76

smasoka commented Oct 30, 2020

ratt-priv-ci commented Oct 30, 2020

bennahugo commented Oct 30, 2020

sjperkins commented Oct 30, 2020

sjperkins commented Nov 2, 2020 •

edited

Loading

bennahugo commented Nov 2, 2020

sjperkins commented Nov 2, 2020

sjperkins commented Nov 2, 2020 •

edited

Loading

bennahugo commented Nov 2, 2020 •

edited

Loading

sjperkins commented Nov 2, 2020

smasoka commented Nov 2, 2020

sjperkins commented Nov 2, 2020

sjperkins commented Nov 2, 2020

bennahugo commented Nov 2, 2020

bennahugo commented Nov 2, 2020

sjperkins Jan 4, 2021

sjperkins commented Jan 7, 2021

smasoka commented Jan 8, 2021

IanHeywood commented Jan 15, 2021

smasoka commented Jan 18, 2021

sjperkins commented Jan 27, 2021

IanHeywood commented Feb 10, 2021

sjperkins commented Feb 11, 2021

WIP: Multi model columns #76

Are you sure you want to change the base?

WIP: Multi model columns #76

Conversation

smasoka commented Oct 30, 2020

ratt-priv-ci commented Oct 30, 2020

bennahugo commented Oct 30, 2020

sjperkins commented Oct 30, 2020

sjperkins commented Nov 2, 2020 • edited Loading

bennahugo commented Nov 2, 2020

sjperkins commented Nov 2, 2020

sjperkins commented Nov 2, 2020 • edited Loading

bennahugo commented Nov 2, 2020 • edited Loading

sjperkins commented Nov 2, 2020

smasoka commented Nov 2, 2020

sjperkins commented Nov 2, 2020

sjperkins commented Nov 2, 2020

bennahugo commented Nov 2, 2020

bennahugo commented Nov 2, 2020

sjperkins Jan 4, 2021

Choose a reason for hiding this comment

sjperkins commented Jan 7, 2021

smasoka commented Jan 8, 2021

IanHeywood commented Jan 15, 2021

smasoka commented Jan 18, 2021

sjperkins commented Jan 27, 2021

IanHeywood commented Feb 10, 2021

sjperkins commented Feb 11, 2021

sjperkins commented Nov 2, 2020 •

edited

Loading

sjperkins commented Nov 2, 2020 •

edited

Loading

bennahugo commented Nov 2, 2020 •

edited

Loading