## VCF_analysis

The **MODIFIER** class represents variants that are located outside of the known genes or inside regions which do not store protein structure information, therefore having potentially the smallest impact on the cell.<br>

The **LOW** class includes variants which are located inside coding regions but they do not result in aminoacid alteration o the corresponding protein (synonymous variants).<br>

Both **MODERATE** and **HIGH** results in changes on the protein level although HIGH has the potential of affecting multiple aminoacids at the same time.<br>

In [18]:
import pandas as pd

In [19]:
# Set ipython's max row display
pd.set_option('display.max_row', 1000)

# Set iPython's max column width to 50
pd.set_option('display.max_columns', 50)

In [20]:
#Load the vcf file
variants=pd.read_table("APD002_T2.consensus.tab", sep="\t")

In [21]:
variants

Unnamed: 0,seqnames,start,end,width,strand,AD,AD.1,AF,Consequence,IMPACT,SYMBOL,HGVSc,HGVSp,CLIN_SIG
0,1,1961657,1961657,1,*,51,10,0.164,missense_variant,MODERATE,GABRD,ENST00000378585.4:c.1295G>A,ENSP00000367848.4:p.Arg432His,
1,1,16461628,16461628,1,*,27,5,0.156,synonymous_variant,LOW,EPHA2,ENST00000358432.5:c.1485C>T,ENSP00000351209.5:p.Asp495%3D,
2,1,21212735,21212735,1,*,41,7,0.146,stop_gained,HIGH,EIF4G3,ENST00000602326.1:c.2233C>T,ENSP00000473510.1:p.Arg745Ter,
3,1,22816948,22816948,1,*,27,8,0.229,synonymous_variant,LOW,ZBTB40,ENST00000404138.1:c.507C>G,ENSP00000384527.1:p.Gly169%3D,
4,1,25880505,25880505,1,*,40,8,0.167,missense_variant,MODERATE,LDLRAP1,ENST00000374338.4:c.181C>G,ENSP00000363458.4:p.Pro61Ala,
5,1,26688414,26688414,1,*,29,13,0.31,missense_variant,MODERATE,ZNF683,ENST00000374204.1:c.1243T>G,ENSP00000363320.1:p.Phe415Val,
6,1,46501061,46501061,1,*,21,9,0.3,downstream_gene_variant,MODIFIER,PIK3R3,,,
7,1,55088984,55088984,1,*,72,11,0.133,missense_variant,MODERATE,FAM151A,ENST00000302250.2:c.85G>A,ENSP00000306888.2:p.Ala29Thr,
8,1,86591188,86591188,1,*,38,5,0.116,missense_variant,MODERATE,COL24A1,ENST00000370571.2:c.831A>C,ENSP00000359603.2:p.Lys277Asn,
9,1,112846126,112846126,1,*,22,5,0.185,intron_variant&non_coding_transcript_variant,MODIFIER,RP5-965F6.2,ENST00000427290.1:n.167+56858G>A,,


In [22]:
#type of variants in the vcf file
variants['Consequence'].unique()

array(['missense_variant', 'synonymous_variant', 'stop_gained',
       'downstream_gene_variant',
       'intron_variant&non_coding_transcript_variant',
       'upstream_gene_variant', 'intron_variant',
       'splice_region_variant&5_prime_UTR_variant', '3_prime_UTR_variant',
       'non_coding_transcript_exon_variant', 'splice_acceptor_variant',
       'splice_region_variant&intron_variant', '5_prime_UTR_variant',
       'frameshift_variant'], dtype=object)

In [23]:
non_coding_mutations=['intron_variant&non_coding_transcript_variant','splice_region_variant&5_prime_UTR_variant', '3_prime_UTR_variant','non_coding_transcript_exon_variant','splice_acceptor_variant','splice_region_variant&intron_variant', '5_prime_UTR_variant']

In [24]:
non_coding_mutations

['intron_variant&non_coding_transcript_variant',
 'splice_region_variant&5_prime_UTR_variant',
 '3_prime_UTR_variant',
 'non_coding_transcript_exon_variant',
 'splice_acceptor_variant',
 'splice_region_variant&intron_variant',
 '5_prime_UTR_variant']

In [25]:
#drop non-coding variants
coding_variants=variants[variants['Consequence'].apply(lambda x:x not in ['intron_variant&non_coding_transcript_variant','splice_region_variant&5_prime_UTR_variant', '3_prime_UTR_variant','non_coding_transcript_exon_variant','splice_acceptor_variant','splice_region_variant&intron_variant', '5_prime_UTR_variant'])]

In [26]:
coding_variants

Unnamed: 0,seqnames,start,end,width,strand,AD,AD.1,AF,Consequence,IMPACT,SYMBOL,HGVSc,HGVSp,CLIN_SIG
0,1,1961657,1961657,1,*,51,10,0.164,missense_variant,MODERATE,GABRD,ENST00000378585.4:c.1295G>A,ENSP00000367848.4:p.Arg432His,
1,1,16461628,16461628,1,*,27,5,0.156,synonymous_variant,LOW,EPHA2,ENST00000358432.5:c.1485C>T,ENSP00000351209.5:p.Asp495%3D,
2,1,21212735,21212735,1,*,41,7,0.146,stop_gained,HIGH,EIF4G3,ENST00000602326.1:c.2233C>T,ENSP00000473510.1:p.Arg745Ter,
3,1,22816948,22816948,1,*,27,8,0.229,synonymous_variant,LOW,ZBTB40,ENST00000404138.1:c.507C>G,ENSP00000384527.1:p.Gly169%3D,
4,1,25880505,25880505,1,*,40,8,0.167,missense_variant,MODERATE,LDLRAP1,ENST00000374338.4:c.181C>G,ENSP00000363458.4:p.Pro61Ala,
5,1,26688414,26688414,1,*,29,13,0.31,missense_variant,MODERATE,ZNF683,ENST00000374204.1:c.1243T>G,ENSP00000363320.1:p.Phe415Val,
6,1,46501061,46501061,1,*,21,9,0.3,downstream_gene_variant,MODIFIER,PIK3R3,,,
7,1,55088984,55088984,1,*,72,11,0.133,missense_variant,MODERATE,FAM151A,ENST00000302250.2:c.85G>A,ENSP00000306888.2:p.Ala29Thr,
8,1,86591188,86591188,1,*,38,5,0.116,missense_variant,MODERATE,COL24A1,ENST00000370571.2:c.831A>C,ENSP00000359603.2:p.Lys277Asn,
10,1,114193692,114193692,1,*,43,7,0.14,synonymous_variant,LOW,MAGI3,ENST00000307546.9:c.2304C>T,ENSP00000304604.9:p.Arg768%3D,


In [27]:
#drop non-coding variants
coding_variants=variants[variants['Consequence'].apply(lambda x:x not in ['intron_variant&non_coding_transcript_variant','splice_region_variant&5_prime_UTR_variant', '3_prime_UTR_variant','non_coding_transcript_exon_variant','splice_acceptor_variant','splice_region_variant&intron_variant', '5_prime_UTR_variant'])]

In [15]:
#non_coding_mutations=['intron_variant&non_coding_transcript_variant','splice_region_variant&5_prime_UTR_variant', '3_prime_UTR_variant','non_coding_transcript_exon_variant','splice_acceptor_variant','splice_region_variant&intron_variant', '5_prime_UTR_variant']
#pattern = '|'.join(non_coding_mutations)
#result = variants.loc[~(variants['Consequence'].str.contains(pattern, case=False)]

In [28]:
#generating matrix for required field
selection=coding_variants[['SYMBOL', 'seqnames','start','end','IMPACT','CLIN_SIG']]

In [11]:
selection

Unnamed: 0,SYMBOL,seqnames,start,end,IMPACT,CLIN_SIG
0,GABRD,1,1961657,1961657,MODERATE,
1,EPHA2,1,16461628,16461628,LOW,
2,EIF4G3,1,21212735,21212735,HIGH,
3,ZBTB40,1,22816948,22816948,LOW,
4,LDLRAP1,1,25880505,25880505,MODERATE,
5,ZNF683,1,26688414,26688414,MODERATE,
6,PIK3R3,1,46501061,46501061,MODIFIER,
7,FAM151A,1,55088984,55088984,MODERATE,
8,COL24A1,1,86591188,86591188,MODERATE,
10,MAGI3,1,114193692,114193692,LOW,


In [29]:
select1=selection.replace(to_replace =["MODERATE", "HIGH"],  
                            value =1)

In [30]:
select0=select1.replace(to_replace =["LOW","MODIFIER"],  
                            value =0)

In [31]:
select0

Unnamed: 0,SYMBOL,seqnames,start,end,IMPACT,CLIN_SIG
0,GABRD,1,1961657,1961657,1,
1,EPHA2,1,16461628,16461628,0,
2,EIF4G3,1,21212735,21212735,1,
3,ZBTB40,1,22816948,22816948,0,
4,LDLRAP1,1,25880505,25880505,1,
5,ZNF683,1,26688414,26688414,1,
6,PIK3R3,1,46501061,46501061,0,
7,FAM151A,1,55088984,55088984,1,
8,COL24A1,1,86591188,86591188,1,
10,MAGI3,1,114193692,114193692,0,


## useful links

https://en.wikipedia.org/wiki/SNP_annotation <br>
https://blog.goldenhelix.com/the-sate-of-variant-annotation-a-comparison-of-annovar-snpeff-and-vep/ <br>
    
https://m.ensembl.org/info/genome/variation/prediction/protein_function.html <br>
https://m.ensembl.org/info/genome/variation/prediction/predicted_data.html <br>
    
https://www.targetvalidation.org/variants <br>
    
### allelic depth (AD)
Allele specific depth. i.e. if ref is 'A' and alt is 'G' and AD is '6,9' you got 6 A reads and 9 G reads.

AD and DP : Allele depth and depth of coverage.

These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.

AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.

DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles.