## Add gene symbols to variant filter table

#### 2019/4/9 Yosuke Tanigawa (ytanigaw@stanford.edu)

Using the data dump from the other notebook (`gene_id_mapping_mygene_mapping_file_generation.ipynb`), 
we add one column (gene_symbol) to variant filter table.


In [1]:
library(tidyverse)
library(data.table)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, first, last

The following object is masked from ‘package:purrr’:

    transpose



In [2]:
variant_id_gene_symbols <- fread(
    cmd=paste0('zcat ', 'variant_id_to_gene_symbol.tsv'),
    head=TRUE, sep='\t'
)

In [3]:
variant_id_gene_symbols %>% head()

ID,Gene,Gene_symbol
rs28659788,ENSG00000237491,AL669831.5
rs116587930,ENSG00000237491,AL669831.5
rs116720794,ENSG00000237491,AL669831.5
rs3131972,ENSG00000240453,RP11-206L10.10
rs12184325,ENSG00000177757,FAM87B
rs3131962,ENSG00000240453,RP11-206L10.10


In [4]:
in_f <- '/oak/stanford/groups/mrivas/private_data/ukbb/variant_filtering/variant_filter_table.old.tsv.gz'


In [5]:
df <- fread(cmd=paste0('zcat ', in_f), sep='\t')

In [6]:
df %>% head()

CHROM,POS,REF,ALT,ID,Gene,Consequence,HGVSp,LoF,LoF_filter,⋯,wcsg_only,bileve_only,filter,missingness,hwe,mcpi,gnomad_af,mgi,mgi_notes,all_filters
1,723307,C,G,rs28659788,ENSG00000237491,intron_variant,,,,⋯,False,True,,1,1,0,,,,2
1,727841,G,A,rs116587930,ENSG00000237491,intron_variant,,,,⋯,False,False,,1,1,0,,,,2
1,729632,C,T,rs116720794,ENSG00000237491,intron_variant,,,,⋯,False,False,,1,1,0,,,,2
1,752721,A,G,rs3131972,ENSG00000240453,intron_variant,,,,⋯,False,False,,0,1,0,,,,1
1,754105,C,T,rs12184325,ENSG00000177757,splice_region_variant,,,,⋯,False,False,,0,1,0,,,,1
1,756604,A,G,rs3131962,ENSG00000240453,upstream_gene_variant,,,,⋯,False,False,,0,0,0,,,,0


In [7]:
df_joined <- df %>% left_join(
    variant_id_gene_symbols %>% select(ID, Gene_symbol), on='ID'
)


Joining, by = "ID"


In [8]:
df_joined %>% head()

CHROM,POS,REF,ALT,ID,Gene,Consequence,HGVSp,LoF,LoF_filter,⋯,bileve_only,filter,missingness,hwe,mcpi,gnomad_af,mgi,mgi_notes,all_filters,Gene_symbol
1,723307,C,G,rs28659788,ENSG00000237491,intron_variant,,,,⋯,True,,1,1,0,,,,2,AL669831.5
1,727841,G,A,rs116587930,ENSG00000237491,intron_variant,,,,⋯,False,,1,1,0,,,,2,AL669831.5
1,729632,C,T,rs116720794,ENSG00000237491,intron_variant,,,,⋯,False,,1,1,0,,,,2,AL669831.5
1,752721,A,G,rs3131972,ENSG00000240453,intron_variant,,,,⋯,False,,0,1,0,,,,1,RP11-206L10.10
1,754105,C,T,rs12184325,ENSG00000177757,splice_region_variant,,,,⋯,False,,0,1,0,,,,1,FAM87B
1,756604,A,G,rs3131962,ENSG00000240453,upstream_gene_variant,,,,⋯,False,,0,0,0,,,,0,RP11-206L10.10


In [9]:
df %>% dim() %>% print()
df_joined %>% dim() %>% print()

[1] 784256     30
[1] 784256     31


In [10]:
df_joined %>% 
fwrite(
    '/oak/stanford/groups/mrivas/private_data/ukbb/variant_filtering/variant_filter_table.new.tsv',
    sep='\t'
)

In [12]:
df_joined %>% 
select(ID, Gene, Consequence, Gene_symbol) %>% 
filter(
    str_length(Gene) > 0,
    str_length(Gene_symbol) == 0
)

ID,Gene,Consequence,Gene_symbol
