Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VEP TSV file format updated so not recognized by parseVEP #540

Closed
zhuchcn opened this issue Jul 27, 2022 · 0 comments · Fixed by #541
Closed

VEP TSV file format updated so not recognized by parseVEP #540

zhuchcn opened this issue Jul 27, 2022 · 0 comments · Fixed by #541
Assignees

Comments

@zhuchcn
Copy link
Member

zhuchcn commented Jul 27, 2022

Seems like these VEP files are different from what we used to deal with. The old VEP files that we worked with for CCLE and CPCGENE have 14 columns, with the last column EXTRA being key value pairs, separate by '='. But seems like in those VEP files, the last column is separated into additional columns? So there are IMPACT, DISTANCE, STRAND1, FLAGS, SOURCE, and GENCODEv34`.

The VEP looks like this:

#Uploaded_variation  Location      Allele  Gene                Feature            Feature_type  Consequence                                   cDNA_position  CDS_position  Protein_position  Amino_acids  Codons  Existing_variation  IMPACT    DISTANCE  STRAND  FLAGS  SOURCE      GENCODEv34
chr1_701618_C/T      chr1:701618   T       ENSG00000230021.10  ENST00000419394.2  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_701618_C/T      chr1:701618   T       ENSG00000230021.10  ENST00000440200.5  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_701618_C/T      chr1:701618   T       ENSG00000230021.10  ENST00000634337.2  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_701618_C/T      chr1:701618   T       ENSG00000230021.10  ENST00000635509.2  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_701618_C/T      chr1:701618   T       ENSG00000230021.10  ENST00000648019.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_872633_G/C      chr1:872633   C       ENSG00000230368.2   ENST00000427857.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_872633_G/C      chr1:872633   C       ENSG00000230368.2   ENST00000446136.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_5935153_C/T     chr1:5935153  T       ENSG00000131697.18  ENST00000378156.9  Transcript    intron_variant                                -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -
chr1_5935153_C/T     chr1:5935153  T       ENSG00000131697.18  ENST00000378169.7  Transcript    intron_variant,NMD_transcript_variant         -              -             -                 -            -       -                   MODIFIER  -         -1      -      GENCODEv34  -

And this is a VEP file from CCLE (path: /hot/project/algorithm/moPepGen/CCLE/processed/mutation/vep/ACH-000001_vep.txt)

#Uploaded_variation  Location    Allele  Gene                Feature             Feature_type  Consequence                         cDNA_position  CDS_position  Protein_position  Amino_acids  Codons   Existing_variation  Extra
1_12299290_C/A       1:12299290  A       ENSG00000048707.15  ENST00000011700.10  Transcript    stop_gained                         2590           2591          864               S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
1_12299290_C/A       1:12299290  A       ENSG00000048707     ENST00000011700     Transcript    stop_gained                         2590           2591          864               S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1;FLAGS=cds_start_NF
1_12299290_C/A       1:12299290  A       ENSG00000048707.15  ENST00000460333.5   Transcript    non_coding_transcript_exon_variant  10             -             -                 -            -        -                   IMPACT=MODIFIER;STRAND=1;SOURCE=GENCODEv34
1_12299290_C/A       1:12299290  A       ENSG00000048707     ENST00000460333     Transcript    non_coding_transcript_exon_variant  10             -             -                 -            -        -                   IMPACT=MODIFIER;STRAND=1
1_12299290_C/A       1:12299290  A       ENSG00000048707.15  ENST00000613099.4   Transcript    stop_gained                         6252           6122          2041              S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
1_12299290_C/A       1:12299290  A       ENSG00000048707     ENST00000613099     Transcript    stop_gained                         6252           6122          2041              S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1
1_12299290_C/A       1:12299290  A       ENSG00000048707.15  ENST00000620676.6   Transcript    stop_gained                         6289           6122          2041              S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
1_12299290_C/A       1:12299290  A       ENSG00000048707     ENST00000620676     Transcript    stop_gained                         6289           6122          2041              S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1
1_12299290_C/A       1:12299290  A       ENSG00000048707.15  ENST00000646917.1   Transcript    stop_gained,NMD_transcript_variant  790            791           264               S/*          tCa/tAa  -                   IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

The last column isn't used for anything at this moment. Maybe we can just modify moPepGen to only take column 0-12. @lydiayliu what do you think?

Originally posted by @zhuchcn in https://github.com/uclahs-cds/project-HNSC-lymphevolution/issues/9#issuecomment-1197198618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant