Skip to content
yvan edited this page May 6, 2014 · 6 revisions

SPO

linked to this issue

New procedure for creating gene panel region files for e.g. coverage calculations

  1. Molecular biologists (e.g. Tove) uses Ensembl biomart to derive a list of selected transcripts for the genes in the gene panel. A tab-separated file containing the following columns is given to a biobinformatician: chromosome, ensemblGeneStart, ensemblGeneEnd, geneSymbol, refseq, strand, band, ensemblGeneID, ensemblTranscriptID, ensemblTxStart, ensemblTxend

  2. The bioinformatician runs the script make_region_files.py, that creates three files:

    • transcripts.csv : One record per transcript with identifiers and exons. eTranscriptID, exonStarts, exonEnds)) + '\n')
    • targetRegions.bed : One record per coding exon (typically) for variantcalling.
    • coverageRegions.bed : As targetRegions.bed, but for coverage regions.
  3. The files are put in the reference repository in their own folder specific to this version of the gene panel.

The script needs as input a standard refGene file from UCSC, available for download from the UCSC table browser. The script matches the RefSeq identifiers from the biomart file with the Refseqs in the refGene file. Multiple Refseqs with the same name in the refGene file can be candidate matches. The refGene Refseq with the smallest total absolute distance in txStart and txEnd is chosen. If multiple Refseqs with the same total txStart and txEnd distance is available, the first of such Refseqs is chosen. A warning is given if the distance in txStart and/or txEnd is greater than 200 bp. (This situation often occurs, as the Ensembl transcript start/stop is different from the Refseq start/stop, but the coding exons typically lie on the exact same positions.)

The transcripts.csv file contains the following tab-separated columns: chromosome, txStart, txEnd, refseqName, 0, strand, geneSymbol, eGeneID, eTranscriptID, exonStarts, exonEnds. The exonStarts and exonEnds contain multiple comma-separated values. This file is the 'main' file describing a gene panel.

The other region files are standard BED6 files. The name column follows the convention gene__transcript__exon1 where exon count start at 1 and increase in the 5'->3' direction of the gene (i.e. strand-dependable). By default, only coding exons are used, but the numbering is according to all exons (coding/noncoding) in the refGene record. The order of apperance of the exons (not the numbering) is always in the 5'->3' direction of the forward strand (i.e. intervals are ordered left-to-right on forward strand). The names in the BED files are crucial for correct aggregation in the coverage calculation scripts.

By default, the script creates two region files, a coverageRegions.bed file using coding exons and a 'slop' of 2, and a targetRegions.bed file using coding exons and a slop of 20.

How I did it

 1. Need to adjust $PYTHONPATH to export PYTHONPATH=/Users/yvans/Home/workspace/amg ONLY!
 2. Ask The Eidi for the ENSEMBL file for brca1/2 genes use in diagnostic
 3. Save it in the working directory, mart_export_BRCA_08112013.xls
  3.1. Change the biomart file's format
  3.1.1. Need to change the the format of the file and according to Tony documentation should have these columns
   3.1.2. Clear formating, URL and others
   3.1.3. Remove

         Ensembl Genes 73  
         Homo sapiens genes GRCh37.p12  

   3.1.4. Save as csv, but change field delimiter to Tabs and no Text delimiters
 4. Get the bed file
  4.1. Go To uses table http://genome.ucsc.edu/cgi-bin/hgTables?command=start
  4.2. Select track:RefSeq Genes
  4.3. output format: all fields from selected table
  4.4. Press get output
  4.5. Save the file in the working directory NM_000059_NM_007294_uscs_export
 6. Run the code

> python ../scripts/make_region_files.py --biomartfile mart_export_BRCA_08112013.csv --refgene NM_000059_NM_007294_uscs_export --genepanel HBOC
From 2 refseqNames, found total of 2 records in refGene. 0 names have duplicates.
transcripts.csv and region files written to path '.'

This produce these new files:

  • HBOC_OUS_medGen_v01_b37.codingExons.bed
  • HBOC_OUS_medGen_v01_b37.codingExons.slop
  • HBOC_OUS_medGen_v01_b37.codingExons.slop
  • HBOC_OUS_medGen_v01_b37.codingExons.slop30.bed
  • HBOC_OUS_medGen_v01_b37.codingExons.slop50.bed

06Mai2014

Follow instructions form this SPO

  1. Download tables from Veronica's email (new candidategenlist: CDG syndromer ) in /Volumes/Nor4GBExFAT/coverage/CDG_mart_export_v2_140414_verbog.xls
  2. Have a look at this issue

Clone this wiki locally