# Annotate bim files from imputed and exome data using ANNOVAR

## Step 1. Creation of a bim file for the imputed data 

In this case I've decided to use the files in this folder `UKBiobank_Yale_transfer/ukb39554_imputeddataset` named `ukb_mfi_chr{1..22}_v3.txt`

The header for this files is described [here](https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=531) and it is a follows: 

NOTE: remember that sometimes the minor allele is not the alternative allele 

* Alternate_id
* RS_id
* Position
* Allele1
* Allele2
* MAF
* Minor Allele
* Info score

From this file we need to create the bim file format 

.bim (PLINK extended MAP file)

Extended variant information file accompanying a .bed binary genotype table.

A text file with no header line, and one line per variant with the following six fields:

    Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
    Variant identifier
    Position in morgans or centimorgans (safe to use dummy value of '0')
    Base-pair coordinate (1-based; limited to 231-2)
    Allele 1 (corresponding to clear bits in .bed; usually minor) 
    Allele 2 (corresponding to set bits in .bed; usually major)

Then this bim file is processed by the annovar.ipynb to create the avinput file with the following format

On each line, the first five space- or tab- delimited columns represent:

    chromosome
    start position
    end position
    the reference nucleotides
    the observed nucleotides

### Create the bim file from the `ukb_mfi_chr{1..22}_v3.txt` files


In [3]:
cd $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset
cat ukb_mfi_chr1_v3.txt | head

  msg['msg_id'] = self._parent_header['header']['msg_id']


1:10177_A_AC	rs367896724	10177	A	AC	0.40079	AC	0.467935
1:10235_T_TA	rs540431307	10235	T	TA	0.000367353	TA	0.214688
1:10352_T_TA	rs201106462	10352	T	TA	0.394625	TA	0.447895
1:10505_A_T	rs548419688	10505	A	T	1.7934e-05	T	0.230206
1:10506_C_G	rs568405545	10506	C	G	1.7934e-05	G	0.230206
1:10511_G_A	rs534229142	10511	G	A	0.00121697	A	0.438272
1:10539_C_A	rs537182016	10539	C	A	0.000484692	A	0.185581
1:10542_C_T	rs572818783	10542	C	T	1.18112e-05	T	0.397842
1:10579_C_A	rs538322974	10579	C	A	4.42114e-06	A	0.578674
1:10616_CCGCCGTTGCAAAGGCGCGCCG_C	1:10616_CCGCCGTTGCAAAGGCGCGCCG_C	10616	CCGCCGTTGCAAAGGCGCGCCG	C	0.00583227	CCGCCGTTGCAAAGGCGCGCCG	0.468098
cat: write error: Broken pipe


In [7]:
# From the ubk_mfi file get the chr, alternate_id, pos, allele 1 (alternative and usually minor) and allele 2 (reference and usually major)
# Add the 0 cM column 
cat ukb_mfi_chr1_v3.txt | awk ' { gsub("_",":",$1); print substr($1,1,1), $1, $3, $5, $4 }' | head

1 1:10177:A:AC 10177 AC A
1 1:10235:T:TA 10235 TA T
1 1:10352:T:TA 10352 TA T
1 1:10505:A:T 10505 T A
1 1:10506:C:G 10506 G C
1 1:10511:G:A 10511 A G
1 1:10539:C:A 10539 A C
1 1:10542:C:T 10542 T C
1 1:10579:C:A 10579 A C
1 1:10616:CCGCCGTTGCAAAGGCGCGCCG:C 10616 C CCGCCGTTGCAAAGGCGCGCCG
cat: write error: Broken pipe


In [19]:
# Add the 0 cM column to the bim file so it can be processed with annovar
cat ukb_mfi_chr1_v3.txt | awk ' { gsub("_",":",$1); print substr($1,1,1), $1, $3, $5, $4 }' | awk 'BEGIN{FS=OFS=" "}{$2 = $2 OFS 0}1' | head

1 1:10177:A:AC 0 10177 AC A
1 1:10235:T:TA 0 10235 TA T
1 1:10352:T:TA 0 10352 TA T
1 1:10505:A:T 0 10505 T A
1 1:10506:C:G 0 10506 G C
1 1:10511:G:A 0 10511 A G
1 1:10539:C:A 0 10539 A C
1 1:10542:C:T 0 10542 T C
1 1:10579:C:A 0 10579 A C
1 1:10616:CCGCCGTTGCAAAGGCGCGCCG:C 0 10616 C CCGCCGTTGCAAAGGCGCGCCG
cat: write error: Broken pipe


In [20]:
# For loop to create the bim for each mfi file in this folder

for file in ukb_mfi_chr{1..22}_v3.txt ; do 
    cat $file | awk ' { gsub("_",":",$1); print substr($1,1,1), $1, $3, $5, $4 }' |awk 'BEGIN{FS=OFS=" "}{$2 = $2 OFS 0}1' > $HOME/UKBiobank/results/ukb39554_imputeddataset_bim_files/${file%.txt}.bim ; done

## Annotate imputed data bim file

In [17]:
UKBB_PATH=$HOME/UKBiobank
UKBB_yale=$HOME/UKBiobank_Yale_transfer
USER_PATH=$HOME/project
container_lmm=$HOME/containers/lmm.sif
container_marp=$HOME/containers/marp.sif
container_annovar=$HOME/containers/gatk4-annovar.sif
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
annovar_sos=$USER_PATH/bioworkflows/variant-annotation/annovar.ipynb
annovar_sbatch=$USER_PATH/UKBB_GWAS_dev/output/annovar_chr1_22_ukb39554_imputed_data_$(date +"%Y-%m-%d").sbatch
cwd=$UKBB_PATH/results/ukb39554_imputeddataset_bim_files
bim_name=$UKBB_PATH/results/ukb39554_imputeddataset_bim_files/ukb_mfi_chr1_v3.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
container_annovar=$HOME/containers/gatk4-annovar.sif
name_prefix='ukb39554_imputed_allvariants'
build='hg19'
walltime='8h'
mem='15G'

annovar_args=""" annovar 
    --cwd $cwd 
    --bim_name $bim_name 
    --humandb $humandb 
    --xref_path $xref_path
    --job_size 1 
    --name_prefix $name_prefix
    --build $build
    --walltime $walltime
    --mem $mem
    --container_annovar $container_annovar
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $annovar_sos \
    --to-script $annovar_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/annovar_chr1_22_ukb39554_imputed_data2021-08-06.sbatch[0m
INFO: Workflow csg (ID=w8b977be5540ac876) is executed successfully with 1 completed step.
