Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trf robustness #1773

Merged
merged 13 commits into from Nov 22, 2022
Merged

Trf robustness #1773

merged 13 commits into from Nov 22, 2022

Conversation

mguerrajordao
Copy link
Collaborator

list of changes:

lib/pymedphys/_trf/decode/config.json

  • added version_row dict and offset for the decoding of the trf from the decoded version in the trf header.
  • added item_part_names with a dictionary for the item_parts list from the decoded data in the trf header.

lib/pymedphys/_trf/decode/header.py

  • some formatting changes were due to the environment using black automaticaly.
  • added extra mu, version, item_parts_number, item_parts_length and the item_parts list decoded from the data in the trf_header
  • improved regex_trf for finding the length of the dynamically generated part of the header
  • decoded the multiple groups (change ascii to utf-8 but the the result should be similar):
  • mu, version and item_parts_number are also decoded from the last group of the regex match.
  • the item_parts is a numpy array dynamically calculated with the item_parts_number which can be read from above. This allows dynamically generating the data, independent of the version.

lib/pymedphys/_trf/decode/table.py
-this exhibits most of the changes. line grouping can be exactly determined from the headed item_parts_number and item_parts list. most of the code that required inspecting the file and trying, can now be removed.

  • function decode_trf_table has changed while keeping the same interface. decode_column calculates the the line_grouping dynamically from the header (version, and item_parts_length and offset). offset needs to be calculated from the version (as in the dictionary on config.json), because for different versions, different dtypes may be used, so this still has to be hardcoded, however it can be easily estimated according to the version.
  • on convert_data_table, convert_applying_negative, divide_by_10 and remaining do not need to be used anymore.
  • the code is smaller but still needs validation

mguerrajordao and others added 3 commits November 2, 2022 00:36
…nd trf table creation modified. run tests and try to maintain the interface on trf2pandas
@SimonBiggs SimonBiggs marked this pull request as draft November 18, 2022 04:31
Copy link
Member

@SimonBiggs SimonBiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few changes to try and fix the "clean" tests

lib/pymedphys/_trf/decode/table.py Outdated Show resolved Hide resolved
lib/pymedphys/_trf/decode/table.py Outdated Show resolved Hide resolved
lib/pymedphys/_imports/imports.py Outdated Show resolved Hide resolved
lib/pymedphys/_imports/imports.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sjswerdloff sjswerdloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
When I merged in my draft PR/branch on my local system, incorporated the missing items in the dict (comment in the review) the trf to-csv went to completion and the data looked about right (less knowing what the items are supposed to contain).
This is a huge step forward for TRF decoding!

"2537_220": "Y1 Leaf 78/Positional Error (mm)",
"2538_220": "Y1 Leaf 79/Positional Error (mm)",
"2539_220": "Y1 Leaf 80/Positional Error (mm)",
"2170_111": "Mlc Status/Actual Value (None)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you want to put this as the column name after the "Y1 Leaf 80/Positional Error" column name entry (below)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i believe this is used on the latest version of MR Linac (v4). But you can double check on the version you have files from. I believe this 2170_111 is present. I agree with the below suggestion with the dictionary filling with unknowns. And then look in the software for completion of config.json later.

np.concatenate((timestamps[i], item_parts))
for i, item_parts in enumerate(item_part_values_data)
]
column_names = ["Timestamp Data"] + [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will raise a key not found error when there is a a value c in the column_names_from_data that is not in the column_names_from_dict. That's appropriate behaviour for python, but...
One could either just add to the dictionary and hope for the best, or

column_names_from_dict_including_unknowns = dict(column_names_from_dict)
        for c in column_names_from_data:
            if c not in column_names_from_dict_including_unknowns:
                column_names_from_dict_including_unknowns[c] = "Item: " + c
                print(f'"{c}": "Item: {c}",')

and emit the missing items before the exception is raised.
That would let the user/programmer add those items in temporarily for themselves in the config.json file. Which (looking forward) is what I did (this code is great!) so I could decode some MR Linac TRF data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much better way to deal with and avoid runtime error. Since we have the list for the item parts from the header then it can be completed later.

@SimonBiggs
Copy link
Member

This is a huge step forward for TRF decoding!

👍 I absolutely agree! Some brilliant work here @mguerrajordao 🙂 🎉

@mguerrajordao
Copy link
Collaborator Author

@sjswerdloff @SimonBiggs Hi both, happy that this small contribution is welcome by you. I learned a lot from the previous code and just tried to give my suggestions. It took a while peaking into the data in the header.
Also the timestamps for each entry were a shot of luck of trying to read the first bytes of each row and decoding it as a different integer. When the value looked like a epoch timestamp I was happy that when compared to the date in the header, it was around the same timestamp, so it gave confidence.
Another difficulty was to pick on somebody else code and try to add something without breaking it. I am still very rookie on the collaboration code, and being pymedphys such a grown library, please excuse any practices that do not conform to the norm.

@sjswerdloff
Copy link
Collaborator

@mguerrajordao , there seem to be some regression test failures (with some of the sample TRF data that is stored on Zenodo for the purpose of regression tests).
Can you look into those?
Running the tests locally can be done with a command line (might be something like):
poetry run pymedphys dev tests --run-only-slow

@mguerrajordao
Copy link
Collaborator Author

@sjswerdloff, yes i have noticed. i have discussed with @SimonBiggs to maintain the interfaces. I am guessing it could be down to naming on the dictionary. i will double check. But could also due to previous output having 4 columns as Unknown which now are set as the Timestamps. So I'll look one by one.
Does the test compare the actual decoded data as well?

@sjswerdloff
Copy link
Collaborator

@mguerrajordao I am not familiar enough with the code and tests to say, but I imagine the values are being compared.
Thats a big part of the value of regression testing...
It sounds like some change to the tests is appropriate given that you are decoding things that were not being decoded before.
I'll try to take a look at this later this week, but if you make edits to the test code and put that in the review, I'll make it a high priority to review that promptly (within 24 hours).

@mguerrajordao
Copy link
Collaborator Author

@sjswerdloff I'll focus on the tests today and push any changes on that if i manage too.
I'll make sure that the regression test pass first or give justification why any part is failing.

@mguerrajordao
Copy link
Collaborator Author

Got one issue:

Original config.json has the linear scales as degree (deg). Somehow in the past i must have corrected to mm. I will rollback the config.json to match original and follow up on next regression failure.

"Table Longitudinal/Scaled Actual (deg)",
"Table Longitudinal/Positional Error (deg)",
"Table Lateral/Scaled Actual (deg)",
"Table Lateral/Positional Error (deg)",
"Table Height/Scaled Actual (deg)",
"Table Height/Positional Error (deg)",

@mguerrajordao
Copy link
Collaborator Author

Got another couple of issues on the value comparison

  • On "Table Isocentric/Scaled Actual (deg)" there is a factor of 10 mismatch. I have corrected it and will investigate further.
  • On "Dose/Raw value (1/64th Mu)" we have negative values with the new method. This is obviously wrong. I haven't noticed this before but i believe this due to the sign/unsigned int nature. The datatype are being read as signed int16 for all the values in a row. (which makes sense for most of the values in bipolar). However i think we are reaching the counter limit on this particular item and additional conversion needs to be made. (this is being processed as +/- 32768 (15bit plus 1 bit for sign). Will check further on how convert it.

@SimonBiggs
Copy link
Member

It sounds like some change to the tests is appropriate given that you are decoding things that were not being decoded before.

One of the ideas was to do this in two or three PRs. Have the first PR (this one) leave all the tests unchanged and all the results the same. Then, have a follow up PR be allowed to adjust the baselines as well as update a range of items. In some scenarios being stuck in this approach would be quite painful. So, we weren't planning on make it a requirement to the level of making it painful, only if it is reasonably achievable.

@SimonBiggs
Copy link
Member

SimonBiggs commented Nov 21, 2022

Original config.json has the linear scales as degree (deg).

Yup, this was done to match the original Elekta TRF decoding tool. And done under the impression that that tool was going to stick around. So it had a job of matching the previous tool, even when it had errors.

But now, that tool is no longer available, and having these column labels be wrong for compatibility is no longer the right choice I believe.

Still, let's have this PR remain consistent with the current baselines, and then a follow up PR can make those corrections to the code and the baselines datasets.

@mguerrajordao
Copy link
Collaborator Author

@SimonBiggs
I am trying to convert whatever is necessary back to pass the regression tests. Working on it. I'll have to modify/add the last step of conversion with some individual conversions in order to match the original dateset. Then we can move forward from there.

@sjswerdloff
Copy link
Collaborator

But now, that tool is no longer available, and having these column labels be wrong for compatibility is no longer the right choice I believe.

I was under the impression the tool was still available, but I have no idea what a clinical site has to do to get it.
Which suggests that one might want to provide some mechanism for retaining that compatibility.
On the other hand, having someone hand edit their CSV or have an outboard conversion for the labelling isn't that big a deal.

@mguerrajordao
Copy link
Collaborator Author

Corrected the difference on "Dose/Raw value (1/64th Mu)" by applying an offset... Now the test passes up to trf (version 1 files where 350 columns are expected):

dataframe["Dose/Raw value (1/64th Mu)"] = dataframe[
"Dose/Raw value (1/64th Mu)"
].apply(lambda x: x + 2**16 if x < 0 else x)

However now failing on the version 3 (Int 4). The converted dataframe has 351 (+1 extra column due to the timestamp), however the reference dataframe has 4 extra columns (from unknown 1 to 4). I will shim this part by splitting the int64 timestamp back into 4 * int16 values so we can continue...

@mguerrajordao
Copy link
Collaborator Author

@SimonBiggs @sjswerdloff
Some progress...

After shimming with unknown splitting int64 into 4*16 i was able to pass the test 😄 .
I will push the code into the branch. I've omitted some warnings on pylinac version and the virtualenv i have.
somehow poetry is clashing with another flask application i have (managed with pipenv for virtual environment). But i think it can be safely ignored.

Running pytest with cwd set to: /mnt/d/programming/pymedphys/lib/pymedphys

Test session starts (platform: linux, Python 3.10.8, pytest 7.1.3, pytest-sugar 0.9.5)
rootdir: /mnt/d/programming/pymedphys/lib/pymedphys
plugins: rerunfailures-10.2, hypothesis-5.49.0, sugar-0.9.5, anyio-3.6.1, Faker-15.3.2
collecting ...
 tests/trf/test_decode.py ✓                                                                                                                                                                                                                                                                                    100% ██████████
====================================================================================================================================================== warnings summary ======================================================================================================================================================

Results (36.33s):
       1 passed
     185 deselected
poetry run pymedphys dev tests -k test_decode --slow  34.12s user 3.89s system 94% cpu 40.038 total
Running pytest with cwd set to: /mnt/d/programming/pymedphys/lib/pymedphys

Test session starts (platform: linux, Python 3.10.8, pytest 7.1.3, pytest-sugar 0.9.5)
rootdir: /mnt/d/programming/pymedphys/lib/pymedphys
plugins: rerunfailures-10.2, hypothesis-5.49.0, sugar-0.9.5, anyio-3.6.1, Faker-15.3.2
collecting ...
 tests/trf/test_date_convert.py ✓                                                                                                                                                                                                                                                                              100% ██████████
====================================================================================================================================================== warnings summary ======================================================================================================================================================

Results (9.13s):
       1 passed
     185 deselected
poetry run pymedphys dev tests -k test_date_convert  7.55s user 3.59s system 88% cpu 12.645 total

@SimonBiggs
Copy link
Member

Beautiful stuff @mguerrajordao :)

…ataframe columns. table.py: amended the factor of 10 on Table Isocentric, corrected from bipolar signed int16 on Dose/Raw value (1/64th of Mu) and split and dropped the int64 Timestamp Data column into 4 int16 named unknow1-4 columns
…o trf-robustness

merging with differences for pytest pass.
Copy link
Member

@SimonBiggs SimonBiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely beautiful stuff @mguerrajordao. It's brilliant to see TRF be taken into the next era of Integrity :).

Thank you so much @mguerrajordao. I have made a few stylistic like comments. This is almost ready for a merge :)

.vscode/settings.json Show resolved Hide resolved
lib/pymedphys/_trf/decode/header.py Outdated Show resolved Hide resolved
lib/pymedphys/_trf/decode/header.py Show resolved Hide resolved
lib/pymedphys/_trf/decode/table.py Outdated Show resolved Hide resolved
@SimonBiggs
Copy link
Member

SimonBiggs commented Nov 21, 2022

Also, heads up @mguerrajordao, running the following in the CI suite on GitHub actions fails for this PR:

poetry run pymedphys dev lint

So, it would be worth running that locally and making sure it passes.

@SimonBiggs
Copy link
Member

Hi Marcelo,

Tried to send you a private email thanking you and asking for feedback, but it was blocked with the following message:

The response from the remote server was:
550 Administrative prohibition - envelope blocked - https://community.mimecast.com/docs/DOC-1369#550 [t1PY0WDfN5qt9Mhz9I8JzA.uk251]

@mguerrajordao
Copy link
Collaborator Author

Hi Simon,
I don't know why. I checked mimecast for flagged messages, and I see your attempt email. It says the envelope rejected with some details. We can use my private email.

@mguerrajordao
Copy link
Collaborator Author

Regarding the poetry lint poetry run pymedphys dev lint
It complains regarding some of the code. But then fails after scoring 9.98/10. i will investigate.

Linting with cwd set to:
    /mnt/d/programming/pymedphys

************* Module lib.pymedphys._streamlit.apps.metersetmap._trf
lib/pymedphys/_streamlit/apps/metersetmap/_trf.py:299:22: E1120: No value for argument 'header_table_contents' in function call (no-value-for-parameter)
************* Module lib.pymedphys._trf.decode.detect
lib/pymedphys/_trf/decode/detect.py:34:16: E1123: Unexpected keyword argument 'input_line_grouping' in function call (unexpected-keyword-arg)
lib/pymedphys/_trf/decode/detect.py:34:16: E1123: Unexpected keyword argument 'input_linac_state_codes_column' in function call (unexpected-keyword-arg)
lib/pymedphys/_trf/decode/detect.py:34:16: E1123: Unexpected keyword argument 'reference_state_code_keys' in function call (unexpected-keyword-arg)
lib/pymedphys/_trf/decode/detect.py:34:16: E1120: No value for argument 'version' in function call (no-value-for-parameter)
lib/pymedphys/_trf/decode/detect.py:34:16: E1120: No value for argument 'item_parts_length' in function call (no-value-for-parameter)
lib/pymedphys/_trf/decode/detect.py:34:16: E1120: No value for argument 'item_parts' in function call (no-value-for-parameter)

-----------------------------------
Your code has been rated at 9.98/10

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/d/programming/pymedphys/lib/pymedphys/cli/__init__.py", line 142, in pymedphys_cli
    args.func(args, remaining)
  File "/mnt/d/programming/pymedphys/lib/pymedphys/_dev/tests.py", line 200, in run_pylint
    subprocess.check_call(command)
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/mguerrajordao/.cache/pypoetry/virtualenvs/pymedphys-l94ZSRXe-py3.10/bin/python', '-m', 'pylint', 'pymedphys', '--rcfile=/mnt/d/programming/pymedphys/lib/pymedphys/.pylintrc']' returned non-zero exit status 2.
poetry run pymedphys dev lint  81.54s user 14.68s system 48% cpu 3:16.68 total

…ded for the processing on table.py. run pytest will all tests marked as slow. all pass
Copy link
Member

@SimonBiggs SimonBiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mguerrajordao,

Beautiful stuff :). I have a few remaining "nit-pick" comments. Which you are free to ignore if you choose. This has my approval :).

Before merging the changelog file at the top of the repo needs to be updated.

lib/pymedphys/_trf/decode/header.py Outdated Show resolved Hide resolved
@@ -12,197 +12,83 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List
# from typing import List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# from typing import List

decoded_rows, column_adjustment_key = decode_rows(trf_table_contents)
version = header_table_contents["version"].values[0].astype(int)
item_parts_length = header_table_contents["item_parts_length"].values[0].astype(int)
item_parts = header_table_contents["item_parts"].values[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth doing this header type conversion back right after the header has been parsed. As in type fixing of the header contents is likely the job of the "decode header" function, not the "decode table" function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SimonBiggs I'm not sure i understand. do you mean passing the variables (version, item_parts_length, item_parts) into the function decode_trf_table, instead of passing header_table_contents?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess where I am confused is with the following piece:

.astype(int)

My thought is that when the header_table_contents is created, what if it was made int right away? This would mean that it wouldn't be the job of downstream functions to convert it to int.

I do actually believe you're doing this already, at least for item_parts_length:

item_parts_length = int(len(item_parts))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And potentially .astype(int) isn't needed for version either?

version = np.frombuffer(groups[4][8:12], dtype=np.int32).item()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't block merge on this though. Really just a "nit-pick". I'll merge after the tests pass, and you can opt to make this change in the next PR if you want to.

lib/pymedphys/_trf/decode/table.py Show resolved Hide resolved
lib/pymedphys/_trf/decode/table.py Outdated Show resolved Hide resolved
@@ -57,7 +57,8 @@ def trf2pandas(trf: path_or_binary_file) -> Tuple["pd.DataFrame", "pd.DataFrame"

trf_header_contents, trf_table_contents = split_into_header_table(trf_contents)
header_dataframe = header_as_dataframe(trf_header_contents)
table_dataframe = decode_trf_table(trf_table_contents)
# table_dataframe = decode_trf_table(trf_table_contents)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# table_dataframe = decode_trf_table(trf_table_contents)

mguerrajordao and others added 2 commits November 22, 2022 09:50
Co-authored-by: Simon Biggs <simon.biggs@radiotherapy.ai>
Co-authored-by: Simon Biggs <simon.biggs@radiotherapy.ai>
@mguerrajordao
Copy link
Collaborator Author

@SimonBiggs Please see comments. Thanks for merging.
Just not too sure on:

Would it be worth doing this header type conversion back right after the header has been parsed. As in type fixing of the header contents is likely the job of the "decode header" function, not the "decode table" function.

@SimonBiggs SimonBiggs merged commit 8b9215d into main Nov 22, 2022
@SimonBiggs
Copy link
Member

SimonBiggs commented Nov 22, 2022

Amazing stuff @mguerrajordao! 🎉 Congrats on becoming a PyMedPhys contributor 🙂. Absolutely wonderful to have your contribution 🙂

Next thing to add to your to-do list is to add yourself to the contributors list 🙂. Put yourself below Derek and above Jake:

https://github.com/pymedphys/pymedphys/blame/main/README.rst#L154

@SimonBiggs SimonBiggs deleted the trf-robustness branch November 22, 2022 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants