-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import additional columns from CR6 #290
Conversation
Thanks for putting this together! LMK if you need any input! |
The added columns caused some current tests to fail. Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs? Thank you! |
scirpy/io/_io.py
Outdated
fwrs = [f"fwr{i}" if i < 5 else f"fwr{i-4}_nt" for i in range(1, 9)] | ||
cdrs = [f"cdr{i}" if i < 3 else f"cdr{i-2}_nt" for i in range(1, 5)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has very unclear semantics... Please either use separate loops for nt and aa or take a look at
itertools.product
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, thanks for the suggestion!
Please only include those columns if they are present in the CR output. |
Sounds good, thanks! Please let me know if there are additional changes that need to be made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed a small thing about the field names!
And thanks a lot for adding tests :)
scirpy/io/_io.py
Outdated
chain[col] = cell[col].get("aa_seq") if cell[col] else None | ||
chain[col + "_nt"] = cell[col].get("nt_seq") if cell[col] else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the rearrangement standard
the fields are cdr1
for nt and cdr1_aa
for aa
One more thing I was thinking about: Do you think it might make sense to automatically add also a chain['cdr3_aa'] = chain['junction_aa'][1:-1]
chain['cdr3'] = chain['junction'][3:-3] ? ping @zktuong, (How) did you deal with that in dandelion? |
Hmm i don't parse that - i just use the junction column. Your suggestion sounds sensible and clean though. |
yup should be no problem |
Sure, I fixed the column names and added trimmed cdr3 as suggested. But the added cdr3 and cdr3aa seemed to cause the following test to fail:
|
that looks alright, I think you can just update the test cases. |
shall we add cdr3 and cdr3_aa by default or only if cdr1 and cdr2 exist? If the former, |
I would say always. But, maybe then we should implement an The behaviour would be that when converting the chain dicts to the anndata object, per default, only the columns defined here would get imported. If the user wants all attributes (including cdr1, 2, 3), they can set The idea is to avoid cluttering anndata with too many columns that are not needed for the most common use-cases. That might be too much to ask from you though - I can always take over this PR if you prefer. |
Yes, that's a great idea. I added an Additionally, should we verify that |
Not sure... In the example file I looked at, indeed all sequences start with
Maybe @zktuong can help us again? Would it make sense to trim these sequences to [1:-1] as well? Or just set CDR3 to None? |
Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive? |
they are flagged as non-productive. So usually they get filtered out anyway. |
yea then skipping trimming and switching to |
ok, perfect. |
HI @grst, I am having trouble with this error, could you please help take a look?
It might result from
as |
|
Sounds good, it would be great if you could take a look at it. Thanks! |
@naity, just to be sure: Did you push all your changes? I'll hopefully find some time later today to debug this. |
Yes, I have. |
1. When include_fields=None, the multi_chain value has not been transferred properly to adata.obs, as it had been overwritten. Rearranged the order of statements to fix it. This prevented bug 2 to surface. 2. Empty Airr cells got initialized with multi_chain=False. This led to a conflict due to 'inconsistent cell-level attributes' as soon as chains got added, even if all added chains correctly had multi_chain=True. Now set it to None initially, as None gets ignored when merging cell-level attributes.
Will take a final look tomorrow and then merge. |
Closes #279.