-
Notifications
You must be signed in to change notification settings - Fork 0
/
numbering.Rmd
237 lines (174 loc) · 8.43 KB
/
numbering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "Numbering amino acid positions"
vignette: >
%\VignetteIndexEntry{hlabud usage examples}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
options(width=80)
library(data.table)
library(dplyr)
library(parameters)
library(pbapply)
library(glue)
library(scales)
library(readr)
library(stringr)
library(magrittr)
library(ggplot2)
library(ggrepel)
library(ggstance)
file_size <- function(x) glue("{fs::file_size(x)}B")
#' @importFrom ggplot2 theme_classic theme element_rect element_line
#' element_blank element_text
#' @importFrom grid unit
theme_kamil <- theme_classic(
base_family = "Arial",
base_size = 16,
base_line_size = 0.3,
base_rect_size = 0.3
) +
theme(
panel.spacing = unit(2, "lines"),
panel.border = element_rect(linewidth = 0.5, fill = NA),
axis.ticks = element_line(linewidth = 0.4),
axis.line = element_blank(),
strip.background = element_blank(),
plot.title = element_text(size = 16),
plot.title.position = "plot",
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 16),
strip.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
axis.text = element_text(size = 16),
axis.title = element_text(size = 16)
)
theme_set(theme_kamil)
options(
ggplot2.discrete.colour = c(
"#666666", "#E69F00", "#56B4E9", "#009E73", "#0072B2", "#D55E00", "#CC79A7"
),
ggplot2.discrete.fill = c(
"#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#666666"
)
)
mpn65 <- c(
'#ff0029', '#377eb8', '#66a61e', '#984ea3', '#00d2d5', '#ff7f00', '#af8d00',
'#7f80cd', '#b3e900', '#c42e60', '#a65628', '#f781bf', '#8dd3c7', '#bebada',
'#fb8072', '#80b1d3', '#fdb462', '#fccde5', '#bc80bd', '#ffed6f', '#c4eaff',
'#cf8c00', '#1b9e77', '#d95f02', '#e7298a', '#e6ab02', '#a6761d', '#0097ff',
'#00d067', '#000000', '#252525', '#525252', '#737373', '#969696', '#bdbdbd',
'#f43600', '#4ba93b', '#5779bb', '#927acc', '#97ee3f', '#bf3947', '#9f5b00',
'#f48758', '#8caed6', '#f2b94f', '#eff26e', '#e43872', '#d9b100', '#9d7a00',
'#698cff', '#d9d9d9', '#00d27e', '#d06800', '#009f82', '#c49200', '#cbe8ff',
'#fecddf', '#c27eb6', '#8cd2ce', '#c4b8d9', '#f883b0', '#a49100', '#f48800',
'#27d0df', '#a04a9b'
)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
Kamil Slowikowski
`r Sys.Date()`
The [IMGTHLA] provides a Github repo with alignments of amino acid sequences and nucleotide sequences for thousands of alleles of the HLA genes.
The IMGTHLA alignments define the official numbering scheme, and they provide some explanations for the conventions on their [help page].
The [hlabud] R package provides easy access to the alignment data, and hlabud follows the official numbering scheme.
The examples below should help beginners to visualize and understand how the conventions work.
[hlabud]: https://github.com/slowkow/hlabud
[IMGTHLA]: https://github.com/ANHIG/IMGTHLA
[help page]: https://www.ebi.ac.uk/ipd/mhc/alignment/help/
# Alignment files on the IMGTHLA Github page
The IMGTHLA Github page provides [a folder](https://github.com/ANHIG/IMGTHLA/tree/v3.56.0-alpha/alignments) with alignment files.
![](https://github.com/ANHIG/IMGTHLA/assets/209714/b0eb1c09-6a49-46f3-b8d9-21db46a82059)
For the examples in this vignette, we will use the *HLA-DRB1* gene.
For DRB1, we can find three separate files:
- https://raw.githubusercontent.com/ANHIG/IMGTHLA/v3.56.0-alpha/alignments/DRB1_gen.txt
- https://raw.githubusercontent.com/ANHIG/IMGTHLA/v3.56.0-alpha/alignments/DRB1_nuc.txt
- https://raw.githubusercontent.com/ANHIG/IMGTHLA/v3.56.0-alpha/alignments/DRB1_prot.txt
The files contain different information:
- `gen` contains genomic DNA sequences.
- `nuc` contains nucleotide coding sequences (CDS).
- `prot` contains protein sequences (amino acids).
Let's consider the `DRB1_prot.txt` file. What does the file look like?
![](https://github.com/ANHIG/IMGTHLA/assets/209714/5ee93946-a31d-4072-883d-daa92bf448a7)
It is a plain text file with a header and sequence alignments.
In the alignment, each line represents one sequence (or allele), and each line has 100 residues.
The first 100 residues for all alleles are shown in the first block.
Then, the next block has the next 100 residues for all of the alleles, and so on.
## Numbering conventions
Here are the conventions used for alignments (copied from EBI):
> - The entry for each allele is displayed in respect to the reference sequences.
> - Where identity to the reference sequence is present the base will be displayed as a hyphen (`-`).
> - Non-identity to the reference sequence is shown by displaying the appropriate base at that position.
> - Where an insertion or deletion has occurred this will be represented by a period (`.`).
> - If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (`*`).
> - In protein alignments for null alleles, the 'Stop' codons will be represented by a hash (`X`).
> - In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
> - These conventions are used for both nucleotide and protein alignments.
That's a lot of information! Let's try to work through an example to illustrate how this works.
The first sequence in the alignment is the **reference sequence**.
The position numbering is relative to the reference sequence.
That means deletions (`.`) in the reference sequence are not numbered.
Notice below that the numbering starts with negative numbers. The [help page] clarifies:
> Protein Sequence Numbering
>
> - For amino acid-based systems, the start codon of the mature protein is labeled codon 1.
> - The codon 5' to this is numbered -1.
> - All numbering is based on the reference sequence.
There is no amino acid with the number 0.
## Numbering indels
The alignment below shows that 100 residues are displayed in chunks of 10:
![](https://github.com/ANHIG/IMGTHLA/assets/209714/0e9d425b-a6fe-4a3a-9aa2-285698087f26)
The numbering convention says that indels in the reference sequence are not numbered.
To clarify this point, I manually added additional numbers (11, 21, 30, 39, 49, 59) to the alignment below:
![](https://github.com/ANHIG/IMGTHLA/assets/209714/64666703-0ab2-4280-a776-a8c39195d209)
Notice that when we move from the first chunk `GDTRPRFLWQ` to the next chunk `LKFECHFFNG` we simply add 10 to 1 to get 11 for the number of the `L` amino acid.
Then, as we move on to `TERVR.LLER`, we add 10 to 11 to get 21 for the `T` amino acid.
However, as we move on to `CIYNQEE.SV` the rule of "add 10" does not work.
Instead of labeling `C` as position 31, we label it position 30. Why?
The reason why `C` is 30, and not 31, is because there is an indel (or gap) in the reference sequence at position 25_26 (notice the `.` in `R.L`).
The convention is that deletions in the reference sequence are not numbered.
Let's take a closer look at this data with hlabud.
Here are the first few amino acid positions for the first 4 sequences:
```{r}
library(hlabud)
a <- hla_alignments("DRB1", release = "3.56.0")
seqs <- substr(a$sequences[1:4], 30, 89)
str_replace_all(seqs, "(\\S{10})", "\\1 ")
```
This is how hlabud numbers the positions that we are focusing on in this example:
```{r}
colnames(a$alleles)[50:70]
```
If hlabud is using the correct numbering, then we should see:
- `T` at position 21
- `C` at position 30
```{r}
a$alleles[1,"21"]
a$alleles[1,"30"]
```
What do we see at positions 25, 26, and 25\_26?
Here is the alignment file:
![](https://github.com/ANHIG/IMGTHLA/assets/209714/7a779b7f-2d2c-4817-95b7-986c2c70171c)
And here is the result from hlabud:
```{r}
a$alleles[1,"25"]
a$alleles[1,"26"]
a$alleles[1,"25_26"]
```
So, we can see that the deletion between positions 25 and 26 is not numbered like the other residues.
Instead, it gets a special label (25\_26) that consists of the positions flanking the indel (25 and 26).
What alleles do we observe at position 25_26?
```{r}
table(a$alleles[,"25_26"])
```
There are three possibilities at position 25\_26:
- `.` indicates a deletion of 1 amino acid (or the absence of an amino at this position)
- `*` indicates that the sequence is unknown at this position
- `W` indicates tryptophan at this position
I hope this example helps to explain the numbering of indels.
If you notice any discrepancy between hlabud and IMGT, please [report it](https://github.com/slowkow/hlabud/issues).