Alignment mismatch | differences count not correct #3

steletvinicius · 2024-04-18T20:19:15Z

First of all, thanks for developing such a powerfull tool.

I have clinical background on clinical HLA typing and I have checked the output from the example you provided
where one can generate an onehot encode matrix for two alleles.

library(hlabud) a <- hla_alignments("DRB1") dosage(c("DRB1*04:174", "DRB1*15:152"), a$onehot)

I found curious two extremelly different alleles - different serologycal groups - show only one amino acid difference.

When I compared your output against IMGT protein sequence alignment tool, the output is different:

As you can see, there are many amino acid differences among these two proteins.
It seems like the line of code you provided is showing only the first difference among the two compared alleles.

I am new to R - used to program in python - and planning to learn by exploring HLA dedicated tools - like yours.

Is this difference something regarding your library or the R line of code ( dosage(c("DRB104:174", "DRB115:152"), a$onehot) )
that is showing only the first difference and not all of them?

The text was updated successfully, but these errors were encountered:

slowkow · 2024-04-19T00:38:45Z

Thank you for opening an issue! I appreciate the comment, and I plan to keep developing this package to make it more comfortable for users to get started with HLA analysis. So, I'd appreciate any criticisms or suggestions for improvements.

You're right, there are a lot of differences between the two alleles, but I set the default option drop_duplicates=TRUE so you don't see them.

In retrospect this was probably a bad idea — I even confused myself earlier today when I was trying to compare alleles. I think the default should be FALSE, to avoid confusing users. I set it to TRUE because the original motivation for this work was to set myself up for association testing on each amino acid position, and I didn't want to double-test positions that are identical to each other.

> library(hlabud)
> a <- hla_alignments("DRB1")
> dosage(c("DRB1*04:174", "DRB1*15:152"), a$onehot)
            pos9_E pos9_W
DRB1*04:174      1      0
DRB1*15:152      0      1

> dosage(c("DRB1*04:174", "DRB1*15:152"), a$onehot, drop_duplicates=FALSE)
            pos9_E pos9_W pos11_P pos11_V pos13_H pos13_R pos34_H pos34_N pos39_S pos39_Y pos49_F
DRB1*04:174      1      0       0       1       1       0       1       0       0       1       0
DRB1*15:152      0      1       1       0       0       1       0       1       1       0       1
            pos49_Y pos53_A pos53_T pos69_I pos69_L pos73_A pos73_K pos82_R pos82_S pos98_V pos99_Q
DRB1*04:174       1       1       0       0       1       0       1       1       0       0       0
DRB1*15:152       0       0       1       1       0       1       0       0       1       1       1
            pos100_P pos101_K pos102_V pos103_T pos104_V pos105_Y pos106_P pos107_S pos108_K
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos109_T pos110_Q pos111_P pos112_L pos113_Q pos114_H pos115_H pos116_N pos117_L
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos118_L pos119_V pos120_C pos121_S pos122_V pos123_S pos124_G pos125_F pos126_Y
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos127_P pos128_G pos129_S pos130_I pos131_E pos132_V pos133_R pos134_W pos135_F
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos136_L pos137_N pos138_G pos139_Q pos140_E pos141_E pos142_K pos143_A pos144_G
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos145_M pos146_V pos147_S pos148_T pos149_G pos150_L pos151_I pos152_Q pos153_N
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos154_G pos155_D pos156_W pos157_T pos158_F pos159_Q pos160_T pos161_L pos162_V
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos163_M pos164_L pos165_E pos166_T pos167_V pos168_P pos169_R pos170_S pos171_G
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos172_E pos173_V pos174_Y pos175_T pos176_C pos177_Q pos178_V pos179_E pos180_H
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos181_P pos182_S pos183_V pos184_T pos185_S pos186_P pos187_L pos188_T pos189_V
DRB1*04:174        0        0        0        0        0        0        0        0        0
DRB1*15:152        1        1        1        1        1        1        1        1        1
            pos190_E pos191_W
DRB1*04:174        0        0
DRB1*15:152        1        1

> a$sequences %>% filter(str_detect(allele, "DRB1\\*04:174|15:152")) %>% as.data.frame
       allele
1 DRB1*04:174
2 DRB1*15:152

                                      seq
1 **********************************---E-V-H------------.F-D-YF-H---.Y-------------A-------------------K-----------------------.******************************************************************************************************
*****************************************
2 **********************************-----P-R------------.F-D-YF-----.----------F-------------------I---A--------S--------------.-Q------------------------------------L--------M----------------------------------------------********
*****************************************

slowkow · 2024-04-19T01:33:05Z

Thanks for sharing the output from the IMGT alignment tool.

I noticed a discrepancy about how insertions or deletions represented by .. I had assumed that these would be numbered, just like the other amino acid positions are numbered. But IMGT does not number them.

If I observed an association for W in DRB1*14:210Q, I would be tempted to write "we see association with W26", but it seems that this is incorrect. Instead, perhaps we should write "we see association with insertion of W between 25 and 26".

Since hlabud should be consistent with IMGT numbering, this is a bug. Thanks again for pointing this out! I will try to fix this as soon as I have a chance.

 AA Pos.                   -21        -11         -1         10         20          30         40         50         60         70 
 DRB1*04:174         ********* ********** ********** *****RFLEQ VKHECHFFNG TERVR.FLDRY FYHQEEYVRF DSDVGEYRAV AELGRPDAEY WNSQKDLLEQ 
 DRB1*14:210Q        ********* ********** ********** *****----Y STG--Y---- -----W----- -HN---F--- ---------- T-----A--H ---------R 
 DRB1*15:152         ********* ********** ********** *****---W- P-R------- -----.----- --N---S--- ------F--- T--------- ------I--- 
 
 AA Pos.                    80         90        100        110        120        130        140        150        160        170 
 DRB1*04:174        KRAAVDTYCR HNYGVGESFT VQRR****** ********** ********** ********** ********** ********** ********** ********** 
 DRB1*14:210Q       R--E------ -----V---- ----VHPKVT VYPSKTQPLQ HHNLLVCSVS GFYPGSIEVR WFRNGQEEKT GVVSTGLIHN GDWTFQTLVM LETVPRSGEV 
 DRB1*15:152        A--------S ---------- ----VQPKVT VYPSKTQPLQ HHNLLVCSVS GFYPGSIEVR WFLNGQEEKA GMVSTGLIQN GDWTFQTLVM LETVPRSGEV 
 
 AA Pos.                   180        190        200        210        220        230 
 DRB1*04:174        ********** ********** ********** ********** ********** ********** *******    
 DRB1*14:210Q       YTCQVEHPSV TSPLTVEW** ********** ********** ********** ********** *******    
 DRB1*15:152        YTCQVEHPSV TSPLTVEW** ********** ********** ********** ********** *******

 AA Pos.                   -21        -11         -1         10         20         30         40         50         60         70 
 DRB1*04:174         ********* ********** ********** *****RFLEQ VKHECHFFNG TERVRFLDRY FYHQEEYVRF DSDVGEYRAV AELGRPDAEY WNSQKDLLEQ 
 DRB1*15:152         ********* ********** ********** *****---W- P-R------- ---------- --N---S--- ------F--- T--------- ------I--- 
 
 AA Pos.                    80         90        100        110        120        130        140        150        160        170 
 DRB1*04:174        KRAAVDTYCR HNYGVGESFT VQRR****** ********** ********** ********** ********** ********** ********** ********** 
 DRB1*15:152        A--------S ---------- ----VQPKVT VYPSKTQPLQ HHNLLVCSVS GFYPGSIEVR WFLNGQEEKA GMVSTGLIQN GDWTFQTLVM LETVPRSGEV 
 
 AA Pos.                   180        190        200        210        220        230 
 DRB1*04:174        ********** ********** ********** ********** ********** ********** *******    
 DRB1*15:152        YTCQVEHPSV TSPLTVEW** ********** ********** ********** ********** *******

steletvinicius · 2024-04-19T18:50:19Z

To be fair, I did not point to the numbering discrepance anytime and have not realized that before you mention it.
My point with the alignment was to highlight the absence of the others discrepancies when executing that line of code.

It is interesting how different people can figure out different things from the same information.

But you are right: as much as you can stick to IMGT numbering as possible users will benefit from that.

I looked into the IMGT help page for the alingment tool (https://www.ebi.ac.uk/ipd/imgt/hla/alignment/help/) for some clue
on how they indels are officially numbered but at the end I have no suggestion to help you on that.

There is a HLA sequencing data analysis software that identifies a DNA base insertion using the IMGT coordinate followed by : and a number starting by 1. Addapting that for amino acid residues, the inserted W would be numbered 25:1.

related to #3

slowkow · 2024-04-25T22:23:27Z

@steletvinicius Hi Vinicius, I think the latest version 2.0.0 should have correct numbering.

I'd appreciate if you could double-check and report any other issues you see.

slowkow · 2024-04-26T19:38:25Z

Please feel free to re-open this issue or comment on it if you notice anything, but I'm closing it for now because I believe the issue is resolved.

slowkow added a commit that referenced this issue Apr 25, 2024

set default drop_duplicates to FALSE

f6d6a49

related to #3

slowkow closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment mismatch | differences count not correct #3

Alignment mismatch | differences count not correct #3

steletvinicius commented Apr 18, 2024

slowkow commented Apr 19, 2024

slowkow commented Apr 19, 2024

steletvinicius commented Apr 19, 2024

slowkow commented Apr 25, 2024

slowkow commented Apr 26, 2024

Alignment mismatch | differences count not correct #3

Alignment mismatch | differences count not correct #3

Comments

steletvinicius commented Apr 18, 2024

slowkow commented Apr 19, 2024

slowkow commented Apr 19, 2024

steletvinicius commented Apr 19, 2024

slowkow commented Apr 25, 2024

slowkow commented Apr 26, 2024