-
Notifications
You must be signed in to change notification settings - Fork 0
Mitokon_creation_of_region_files
Mitokon coverage file.
- Create a directory here:
~/Home/workspace/Coverage - Download into the newly created folder the file:
Mitokondriesykdommer_v2_mart_export_tskodje_240914.txt from Tove (Genpanel til coverage analyse 24/09/2014) - Follow instructions from: https://github.com/genevar/amgdoc/blob/dev/SOP/Creation_of_region_files/SOP_001_Regionfiles.md
This standard operating procedure (SOP) explains how to create the region files that will be used in the variant calling pipeline.
Responsible person: A bioinformatician is responsible for running scripts that act on the input data to provide the region files. The input data can be downloaded from the website or created by a molecular biologist/bioinformatician, which will be taken care by the corresponding person.
Ensembl biomart, in-house scripts that create region files, a terminal, and optionally Emacs.
-
An export file from Ensembl containing the following data for the selected transcripts:
chromosome, EnsemblGeneStart, EnsemblGeneEnd, HUGO geneSymbol, refseq, strand, band, EnsemblGeneID, EnsemblTranscriptID, transcriptionStart, transcriptionEnd, entrezID, EntrezGene ID, Ensembl Protein ID, Kromosomlokalisering. -
A refGene file from UCSC containing all available columns. The most recent refGene file in the common datarepo should be used.
-
copy the origanle file to a cdv file
cp Mitokondriesykdommer_v2_mart_export_tskodje_240914.txt Mitokondriesykdommer_v2_mart_export_tskodje_240914.csv -
Format Ensembl biomart file with Unix newlines. Load the MS Excel file into Excel and remove all rows/columns that do not contain the primary data (header and transcript data). Save as a tab-separated text file. Done!
If you're working on a Mac, this will add Mac carriage return (^M) to the file instead of newline characters (\n). If so, open the text file in Emacs andM-x set-buffer-file-coding-systemtoutf-8-unix(keyboard shortcut is Ctrl-x Enter f). This will give the file newlines instead of CR. Save the file (Ctrl-x-s).If you are not using emacs the unix tr command can be use. Simply type in a terminal:
tr '\r' \'n\' < Mitokondriesykdommer_v2_mart_export_tskodje_240914.csv > Mitokondriesykdommer_v2_mart_export_tskodje_240914_corrected.csv -
Creation of an empty HGMD.bed file~~This file is necessary for the ??? script to work. ~~
touch HGMD.bed -
Create region files from Ensembl biomart file, Make sure that the file is tab separated and NOT comma The bioinformatician runs the script
make_region_files.pylike this:python /Users/yvans/Home/workspace/amg/scripts/make_region_files.py \ --biomartfile=./Mitokondriesykdommer_v2_mart_export_tskodje_240914_corrected_minuslast3col_corrected.csv \ # **corrected export file from Ensembl** --refgene=/Volumes/Analysis/dataDistro_r01_d01_LocalCopy/b37/funcAnnot/refSeq/refGene_131119.tab \ # **RefSeq export file from UCSC** --outputdir=./ \ # **Where one want to save the region file** --genepanel=Mitokon_test \ # **Name of the gene package** --version=01 \ # **Version number** --dict=/Volumes/Analysis/dataDistro_r01_d01_LocalCopy/b37/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict # **Dictionary index from the reference genome**
The biomart file is the file from Ensembl, whereas the refgene file is the standard refGene table from the UCSC Genome browser. The lastest file in the common data repo should be used. The versioning starts at 01 and proceeds as 02, 03.. if the gene panel is updated. The genomeDict is the file from the common datarepo that defines a sorting order.
-
Create INI files from the BED region files (can be skipped)For each BED file produced, run the following script:~~
bedToInterval.bash genome.dict bedfile.bed~~~~This command can run the scrypt for every bed file and saving the output to errMake_INI_files: ~~
for i in~~do echo "for file "$i ;echo ; ~~ ~~/Users/yvans/Home/workspace/amg/repos/bedToInterval.bash ~~ ~~/Volumes/Analysis/dataDistro_r01_d01_LocalCopy/b37/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict $i ; ~~ls *bed; \ ]echo;echo;done &> errMake_INI_files -
Place files in gene panel directory All files produced, including the original Ensembl biomart file and the hgmd.bed are put in github under the genepanels directory, in their own subdirectory specific to this version of the gene panel. The directory is named in the same manner as the files, but without the endings (codingExons, .slopX, .bed. and so on). Example:
Bindevev_OUS_medGen_v01_b37Name the Ensembl file asEnsembl_biomart_export.xls. (Maybe it is good to keep the original file name) -
Create soft-links to coverage regions file Create a link to the bed file that contains the regions used for calculating coverage. This is typically the file with slop 2. For example (for gene panel Bindevev):
ln -s Bindevev_OUS_medGen_v01_b37.codingExons.slop2.bed coverageRegions.bed -
Create a README.md file that briefly describes what has been done to produce the gene panel files. Amongst other things, the file should say which refGene version was used to produce the region files, and state explicitly whether there were any duplicate refseq names found in refGene when running the
make_region_files.pyscript.
-
If there are duplicate refseq names in refGene for any of the selected refseqs, then a manual inspection of each matching record in refGene is necessary to ensure that the correct refseq is selected.
-
If no HGMD variants are available for a gene panel, an empty HGMD.bed file must be placed in the gene panel directory.
-
See “4.6 HGMD Mutasjon” from SOP “Bearbeidelse av kandidategenlister for bioinformatisk analyse” for details about using HGMD MutationMart to find the genomic coordinates for variants registered in HGMD Professional.
-
More details about the region files is also given in the wiki page https://github.com/genevar/amg/wiki/Bed-files-creation