Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] scirpy conversion - rename productive #153

Closed
zktuong opened this issue Jun 9, 2022 · 4 comments · Fixed by #152
Closed

[BUG] scirpy conversion - rename productive #153

zktuong opened this issue Jun 9, 2022 · 4 comments · Fixed by #152
Labels
bug Something isn't working

Comments

@zktuong
Copy link
Owner

zktuong commented Jun 9, 2022

related to scverse/scirpy#343

@zktuong zktuong added the bug Something isn't working label Jun 9, 2022
@zktuong zktuong linked a pull request Jun 9, 2022 that will close this issue
@zktuong
Copy link
Owner Author

zktuong commented Jun 10, 2022

I have a more question during update germline sequence by update_germline. I have many samples to update. Should the fasta file be "tigger_heavy_igblast_db-pass_genotype.fasta" ? ( I also got the error in this case) or manually specify in each sample ?

OSError: Environmental variable GERMLINE must be set. Otherwise, please provide path to folder containing germline IGHV, IGHD, and IGHJ fasta files.

Hi @sbenjamaporn,

just a few things to ask -

  1. did you run the preprocessing with the singularity container?

  2. if you already have that tigger file, chances are you already have a germline_alignment_d_mask column in your data, and you can just use this directly (i.e. skip both update_germline and create_germlines) and just go straight to quantify_mutations and don't need to mess around with create_germline - unless tigger failed?

  3. having said that, if you ran the preprocessing through the singularity container, then you can also do vdj.update_plus() and this will retrieve the mutation count and frequency columns into the metadata.

Some clarification:

  1. update_germline is just to store the germline slot in the Dandelion object for easy retrieval of the sequences when running create_germlines. So you will still need to run create_germlines.

  2. if you are going to manually specify tigger_heavy_igblast_db-pass_genotype.fasta, then each Dandelion object should only hold the sequences that tigger was run on. If sample A, B and C were belonging to individuals 1, 2 and 1, there should be two Dandelion objects, where it's A + C and B separately.

If you follow the documentation, there's an instruction like:

vdj.update_germline(corrected = 'path/to/tigger_heavy_igblast_db-pass_genotype.fasta', germline = None, org = 'human')

where germline is set as None because it's stored as an environmental variable.

I just noticed another bug with the if-else statement that would prevent manual input of the germline option which I'm looking into fixing. So the current workarounds are either:

import os
os.environ['GERMLINE'] = '/path/to/database/germlines/' # download and unpack the database file from https://github.com/zktuong/databases_for_vdj
vdj.update_germline(corrected = 'path/to/tigger_heavy_igblast_db-pass_genotype.fasta', germline = None, org = 'human')

or directly update vdj.germline with a dictionary like:

from changeo.IO import readGermlines
gml = [
'path/to/database/germlines/imgt/human/vdj/imgt_human_IGHV.fasta', 'path/to/database/germlines/imgt/human/vdj/imgt_human_IGHD.fasta', 'path/to/database/germlines/imgt/human/vdj/imgt_human_IGHJ.fasta',
'path/to/tigger_heavy_igblast_db-pass_genotype.fasta' # place this last
]
vdj.germlines.update(readGermlines(gml))

This can then be followed up with ddl.pp.create_germlines and ddl.pp.quantify_mutations.

Let me know if there's any issues

@sbenjamaporn
Copy link

Dear @zktuong,

Thanks for your helping and recommendation. I run the preprocessing via singularity container. Then, I use scirpy to define my clonotype. The output from scirpy did not give the column "germline_alignment_d_mask" to me, so I think even if I convert scirpy's AnnData to dandelion, the dandelion could not apply "quantify_mutations" to it, properly.
My error: RRuntimeError: Error in (function (db, sequenceColumn = "sequence_alignment", germlineColumn = "germline_alignment_d_mask", :
The column germline_alignment_d_mask was not found

I then use update germline that you have suggested (import os), follow by create_germlines. The result also show KeyError: "['germline_alignment_d_mask'] not in index".

To sum up, my main problem is "germline_alignment_d_mask did not be found in my data".

My next solution is "I will merge AIRR from your dandelion (result before I convert to scirpy containning "germline_alignment_d_mask" with AIRR from scirpy (no germline_alignment_d_mask information) to correct germline of each sequence.

Thanks again! And if you have any more suggestions, feel free to let me know

@zktuong
Copy link
Owner Author

zktuong commented Jun 14, 2022

ah ok! i see.

scirpy's conversion only transfers some default AIRR fields. to transfer everything found in a dandelion object, you should do this:

adata = ir.io.from_dandelion(vdj, include_fields = vdj.data.columns) # there's a bug in dandelion's ddl.to_scirpy that doesn't accept the additional kwargs but will be fixed in the next version

you will then find if you transfer back, the columns will be present:

vdj2 = ddl.from_scirpy(adata)
'germline_alignment_d_mask' in vdj2.data

@sbenjamaporn
Copy link

Dear @zktuong,

It works! Thank you so much for developing this tool and kindly responding to me.

Best regards,
Benjamaporn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants