# Generating Genome Feature Tracks

In this notebook, I'll use the [NCBI assembly](https://www.ncbi.nlm.nih.gov/assembly/247141) to create genome feature tracks for *Fundulus heteroclitus*

## 0. Set working directory and variables

In [1]:
!pwd

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/code


In [2]:
cd ..

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS


In [3]:
!mkdir genome-features

mkdir: genome-features: File exists


In [3]:
cd genome-features

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/genome-features


In [4]:
!which bedtools

/opt/homebrew/bin/bedtools


In [5]:
bedtoolsDirectory = "/opt/homebrew/bin"

## 1. Download NCBI assembly

I downloaded the GFF from [this link](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/125/445/GCF_011125445.2_MU-UCD_Fhet_4.1/).

In [6]:
!curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/125/445/GCF_011125445.2_MU-UCD_Fhet_4.1/GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff.gz > GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.5M  100 16.5M    0     0  6506k      0  0:00:02  0:00:02 --:--:-- 6506k


In [8]:
!gunzip GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff.gz

In [9]:
!head -n100 GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build MU-UCD_Fhet_4.1
#!genome-build-accession NCBI_Assembly:GCF_011125445.2
#!annotation-source NCBI Fundulus heteroclitus Annotation Release 102
##sequence-region NC_046361.1 1 44038137
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8078
NC_046361.1	RefSeq	region	1	44038137	.	+	.	ID=NC_046361.1:1..44038137;Dbxref=taxon:8078;Name=1;chromosome=1;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=FHET01;mol_type=genomic DNA;sex=male;tissue-type=pool
NC_046361.1	Gnomon	gene	65844	103004	.	+	.	ID=gene-LOC105920932;Dbxref=GeneID:105920932;Name=LOC105920932;gbkey=Gene;gene=LOC105920932;gene_biotype=protein_coding
NC_046361.1	Gnomon	mRNA	65844	103004	.	+	.	ID=rna-XM_036139628.1;Parent=gene-LOC105920932;Dbxref=GeneID:105920932,Genbank:XM_036139628.1;Name=XM_036139628.1;gbkey=mRNA;gene=LOC105920932;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 5 Proteins%2C and

NC_046361.1	Gnomon	exon	101288	101338	.	+	.	ID=exon-XM_036139679.1-14;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X6;transcript_id=XM_036139679.1
NC_046361.1	Gnomon	exon	101454	101535	.	+	.	ID=exon-XM_036139679.1-15;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X6;transcript_id=XM_036139679.1
NC_046361.1	Gnomon	exon	101608	101871	.	+	.	ID=exon-XM_036139679.1-16;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X6;transcript_id=XM_036139679.1
NC_046361.1	Gnomon	exon	102399	103004	.	+	.	ID=exon-XM_036139679.1-17;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-c

## 2. Prepare for feature track creation

Before I pull out feature tracks, I need to know which databases were used for annotation, which features I can expect and how many of them there are, and identify chromosome lengths.

In [10]:
#Database identifiers for extracting features
!cut -f2 GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff | sort | uniq -c | tail

   1 ##sequence-region NW_023397469.1 1 25886
   1 ##sequence-region NW_023397470.1 1 25879
   1 ##sequence-region NW_023397471.1 1 25851
1031 ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8078
1564 BestRefSeq
  25 BestRefSeq%2CGnomon
1413470 Gnomon
19796 RefSeq
1757 cmsearch
5824 tRNAscan-SE


In [11]:
#Count the number of unique features in the GFF
!cut -f3 GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff | sort | uniq -c

   1 #!annotation-source NCBI Fundulus heteroclitus Annotation Release 102
   1 #!genome-build MU-UCD_Fhet_4.1
   1 #!genome-build-accession NCBI_Assembly:GCF_011125445.2
   1 #!gff-spec-version 1.21
   1 #!processor NCBI annotwriter
   1 ###
   1 ##gff-version 3
   1 ##sequence-region NC_012312.1 1 16526
   1 ##sequence-region NC_046361.1 1 44038137
   1 ##sequence-region NC_046362.1 1 39997022
   1 ##sequence-region NC_046363.1 1 47212184
   1 ##sequence-region NC_046364.1 1 41051309
   1 ##sequence-region NC_046365.1 1 44126044
   1 ##sequence-region NC_046366.1 1 34002518
   1 ##sequence-region NC_046367.1 1 44089915
   1 ##sequence-region NC_046368.1 1 35810078
   1 ##sequence-region NC_046369.1 1 40030575
   1 ##sequence-region NC_046370.1 1 33287791
   1 ##sequence-region NC_046371.1 1 39656907
   1 ##sequence-region NC_046372.1 1 49273114
   1 ##sequence-region NC_046373.1 1 42418174
   1 ##sequence-region NC_046374.1 1 35742472
   1 ##sequence-region NC_0

634024 CDS
  89 C_gene_segment
   1 D_loop
 164 V_gene_segment
18663 cDNA_match
695255 exon
34055 gene
   7 guide_RNA
5668 lnc_RNA
49534 mRNA
   1 origin_of_replication
 631 pseudogene
 119 rRNA
1031 region
 147 snRNA
 178 snoRNA
1908 tRNA
 961 transcript


The file includes CDS, exon, gene, lnc_RNA, mRNA, and transcript annotations. I'll pull the information from Gnomom, RefSeq, cmsearch, and tRNAscan-SE databases.

In [12]:
#Chromosome length information
!head mummichog.chrom.length

NC_046361.1	44038137
NC_046362.1	39997022
NC_046363.1	47212184
NC_046364.1	41051309
NC_046365.1	44126044
NC_046366.1	34002518
NC_046367.1	44089915
NC_046368.1	35810078
NC_046369.1	40030575
NC_046370.1	33287791


## 2. Generate genome feature tracks

I will extract CDS, exon, gene, lncRNA and mRNA tracks. I can then use those existing tracks to produce intron and  intergenic tracks, as well as 1 kb upstream and downstream flanking regions with `bedtools`. I will also use the RepeatMasker output from NCBI for my transposable element track.

In [13]:
!{bedtoolsDirectory}/bedtools --version

bedtools v2.31.1


### 2a. Gene

In [14]:
#Isolate gene entries from multiple annotation databses. Tab mus be included between database and feature
#Sort output for downstream use
#Include chromosome name information
!grep -e "Gnomon	gene" -e "RefSeq	gene" -e "cmsearch	gene" -e "tRNAscan-SE	gene" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-gene.gff

In [15]:
!head GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-gene.gff
!wc -l GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-gene.gff

NC_046361.1	Gnomon	gene	65844	103004	.	+	.	ID=gene-LOC105920932;Dbxref=GeneID:105920932;Name=LOC105920932;gbkey=Gene;gene=LOC105920932;gene_biotype=protein_coding
NC_046361.1	Gnomon	gene	110815	113776	.	+	.	ID=gene-LOC118563811;Dbxref=GeneID:118563811;Name=LOC118563811;gbkey=Gene;gene=LOC118563811;gene_biotype=protein_coding
NC_046361.1	Gnomon	gene	122821	186471	.	+	.	ID=gene-stk38a;Dbxref=GeneID:105920930;Name=stk38a;gbkey=Gene;gene=stk38a;gene_biotype=protein_coding
NC_046361.1	Gnomon	gene	186912	191941	.	-	.	ID=gene-pex10;Dbxref=GeneID:105920931;Name=pex10;gbkey=Gene;gene=pex10;gene_biotype=protein_coding
NC_046361.1	Gnomon	gene	232913	240564	.	+	.	ID=gene-LOC110367680;Dbxref=GeneID:110367680;Name=LOC110367680;gbkey=Gene;gene=LOC110367680;gene_biotype=protein_coding
NC_046361.1	Gnomon	gene	275445	276112	.	+	.	ID=gene-LOC118564105;Dbxref=GeneID:118564105;Name=LOC118564105;gbkey=Gene;gene=LOC118564105;gene_biotype=lncRNA
NC_046361.1	Gnomon	gene	284255	629865	.	+	.	ID=gene-plch2a;Dbxre

### 2b. CDS

In [16]:
!grep -e "Gnomon	CDS" -e "RefSeq	CDS" -e "cmsearch	CDS" -e "tRNAscan-SE	CDS" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-CDS.gff

In [17]:
!head GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-CDS.gff
!wc -l GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-CDS.gff

NC_046361.1	Gnomon	CDS	71404	71467	.	+	0	ID=cds-XP_035995402.1;Parent=rna-XM_036139509.1;Dbxref=GeneID:105920932,Genbank:XP_035995402.1;Name=XP_035995402.1;gbkey=CDS;gene=LOC105920932;product=DENN domain-containing protein 2D isoform X2;protein_id=XP_035995402.1
NC_046361.1	Gnomon	CDS	71404	71467	.	+	0	ID=cds-XP_035995482.1;Parent=rna-XM_036139589.1;Dbxref=GeneID:105920932,Genbank:XP_035995482.1;Name=XP_035995482.1;gbkey=CDS;gene=LOC105920932;product=DENN domain-containing protein 2D isoform X4;protein_id=XP_035995482.1
NC_046361.1	Gnomon	CDS	71404	71467	.	+	0	ID=cds-XP_035995438.1;Parent=rna-XM_036139545.1;Dbxref=GeneID:105920932,Genbank:XP_035995438.1;Name=XP_035995438.1;gbkey=CDS;gene=LOC105920932;product=DENN domain-containing protein 2D isoform X3;protein_id=XP_035995438.1
NC_046361.1	Gnomon	CDS	71404	71467	.	+	0	ID=cds-XP_035995353.1;Parent=rna-XM_036139460.1;Dbxref=GeneID:105920932,Genbank:XP_035995353.1;Name=XP_035995353.1;gbkey=CDS;gene=LOC105920932;product=DENN domain-contain

### 2c. Exon

In [18]:
!grep -e "Gnomon	exon" -e "RefSeq	exon" -e "cmsearch	exon" -e "tRNAscan-SE	exon" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-exon.gff

In [19]:
!head GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-exon.gff
!wc -l GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-exon.gff

NC_046361.1	Gnomon	exon	65844	65948	.	+	.	ID=exon-XM_036139628.1-1;Parent=rna-XM_036139628.1;Dbxref=GeneID:105920932,Genbank:XM_036139628.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X5;transcript_id=XM_036139628.1
NC_046361.1	Gnomon	exon	68141	68185	.	+	.	ID=exon-XM_036139679.1-1;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X6;transcript_id=XM_036139679.1
NC_046361.1	Gnomon	exon	68414	68480	.	+	.	ID=exon-XM_036139628.1-2;Parent=rna-XM_036139628.1;Dbxref=GeneID:105920932,Genbank:XM_036139628.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing protein 2D%2C transcript variant X5;transcript_id=XM_036139628.1
NC_046361.1	Gnomon	exon	68414	68480	.	+	.	ID=exon-XM_036139679.1-2;Parent=rna-XM_036139679.1;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;gbkey=mRNA;gene=LOC105920932;product=DENN domain-containing prote

### 2d. lncRNA

In [20]:
!grep -e "Gnomon	lnc_RNA" -e "RefSeq	lnc_RNA" -e "cmsearch	lnc_RNA" -e "tRNAscan-SE	lnc_RNA" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-lnc_RNA.gff

In [21]:
!head GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-lnc_RNA.gff
!wc -l GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-lnc_RNA.gff

NC_046361.1	Gnomon	lnc_RNA	275445	276112	.	+	.	ID=rna-XR_004931641.1;Parent=gene-LOC118564105;Dbxref=GeneID:118564105,Genbank:XR_004931641.1;Name=XR_004931641.1;gbkey=ncRNA;gene=LOC118564105;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=uncharacterized LOC118564105;transcript_id=XR_004931641.1
NC_046361.1	Gnomon	lnc_RNA	678041	688360	.	+	.	ID=rna-XR_004931662.1;Parent=gene-LOC118564132;Dbxref=GeneID:118564132,Genbank:XR_004931662.1;Name=XR_004931662.1;gbkey=ncRNA;gene=LOC118564132;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=uncharacterized LOC118564132;transcript_id=XR_004931662.1
NC_046361.1	Gnomon	lnc_RNA	725471	726046	.	+	.	ID=rna-XR_004931679.1;Parent=gene-LOC118564153;Dbxref=Ge

### 2e. mRNA

In [22]:
!grep -e "Gnomon	mRNA" -e "RefSeq	mRNA" -e "cmsearch	mRNA" -e "tRNAscan-SE	mRNA" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-mRNA.gff

In [23]:
!head GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-mRNA.gff
!wc -l GCF_011125445.2_MU-UCD_Fhet_4.1_genomic-mRNA.gff

NC_046361.1	Gnomon	mRNA	65844	103004	.	+	.	ID=rna-XM_036139628.1;Parent=gene-LOC105920932;Dbxref=GeneID:105920932,Genbank:XM_036139628.1;Name=XM_036139628.1;gbkey=mRNA;gene=LOC105920932;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=DENN domain-containing protein 2D%2C transcript variant X5;transcript_id=XM_036139628.1
NC_046361.1	Gnomon	mRNA	68141	103004	.	+	.	ID=rna-XM_036139679.1;Parent=gene-LOC105920932;Dbxref=GeneID:105920932,Genbank:XM_036139679.1;Name=XM_036139679.1;gbkey=mRNA;gene=LOC105920932;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=DENN domain-containing protein 2D%2C transcript variant X6;transcript_id=XM_036139679.1
NC_046361.1	Gnomon	mRNA	71006	103004	.	+	.	ID=rna-XM_0361395

### 2f. Transcripts

In [10]:
!grep -e "Gnomon	transcript" -e "RefSeq	transcript" -e "cmsearch	transcript" -e "tRNAscan-SE	transcript" \
GCF_011125445.2_MU-UCD_Fhet_4.1_genomic.gff \
| {bedtoolsDirectory}/sortBed \
-faidx mummichog.chrom.length \
> Fundulus_heteroclitus-3.0.2.105-transcript.gff

In [11]:
!head Fundulus_heteroclitus-3.0.2.105-transcript.gff
!wc -l Fundulus_heteroclitus-3.0.2.105-transcript.gff

KN805525.1	ensembl	transcript	2500	10348	.	-	.	gene_id "ENSFHEG00000014345"; gene_version "1"; transcript_id "ENSFHET00000020248"; transcript_version "1"; gene_name "cdx1a"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "cdx1a-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
KN805525.1	ensembl	transcript	38959	57020	.	+	.	gene_id "ENSFHEG00000014326"; gene_version "1"; transcript_id "ENSFHET00000020197"; transcript_version "1"; gene_name "pdgfrb"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "pdgfrb-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
KN805525.1	ensembl	transcript	62194	78208	.	+	.	gene_id "ENSFHEG00000014282"; gene_version "1"; transcript_id "ENSFHET00000020149"; transcript_version "1"; gene_name "csf1ra"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "csf1ra-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
KN805525.1	ensembl	transc

### 2f. Non-coding sequences

In [7]:
#Find the complement to the exon track (non-coding sequences)
#Create a BEDfile of IGV
!{bedtoolsDirectory}/complementBed \
-i Fundulus_heteroclitus-3.0.2.105-exon.gff \
-g mummichog.chrom.length \
> Fundulus_heteroclitus-3.0.2.105-nonCDS.bed

In [8]:
!head Fundulus_heteroclitus-3.0.2.105-nonCDS.bed
!wc -l Fundulus_heteroclitus-3.0.2.105-nonCDS.bed

KN805525.1	0	2499
KN805525.1	2651	5165
KN805525.1	5274	8334
KN805525.1	8483	10200
KN805525.1	10348	38958
KN805525.1	38960	39612
KN805525.1	39736	39793
KN805525.1	39899	39983
KN805525.1	40241	40741
KN805525.1	40887	41873
  223862 Fundulus_heteroclitus-3.0.2.105-nonCDS.bed


### 2g. Intron

In [9]:
#Find the intersection between the non-coding sequences and genes (introns)
!{bedtoolsDirectory}/intersectBed \
-a Fundulus_heteroclitus-3.0.2.105-nonCDS.bed \
-b Fundulus_heteroclitus-3.0.2.105-gene.gff -sorted \
> Fundulus_heteroclitus-3.0.2.105-intron.bed

In [10]:
!head Fundulus_heteroclitus-3.0.2.105-intron.bed
!wc -l Fundulus_heteroclitus-3.0.2.105-intron.bed

KN805525.1	2651	5165
KN805525.1	5274	8334
KN805525.1	8483	10200
KN805525.1	38960	39612
KN805525.1	39736	39793
KN805525.1	39899	39983
KN805525.1	40241	40741
KN805525.1	40887	41873
KN805525.1	41997	43746
KN805525.1	43927	45300
  198417 Fundulus_heteroclitus-3.0.2.105-intron.bed


### 2j. Intergenic regions

In [11]:
#Find the complement of genes to obtain intergenic regions
!{bedtoolsDirectory}/complementBed \
-i Fundulus_heteroclitus-3.0.2.105-gene.gff -sorted \
-g mummichog.chrom.length \
> Fundulus_heteroclitus-3.0.2.105-intergenic.bed

In [12]:
!head Fundulus_heteroclitus-3.0.2.105-intergenic.bed
!wc -l Fundulus_heteroclitus-3.0.2.105-intergenic.bed

KN805525.1	0	2499
KN805525.1	10348	38958
KN805525.1	57020	62193
KN805525.1	78208	85894
KN805525.1	96387	96806
KN805525.1	106513	107709
KN805525.1	111142	130423
KN805525.1	137024	142523
KN805525.1	211966	214864
KN805525.1	229810	250423
   30648 Fundulus_heteroclitus-3.0.2.105-intergenic.bed
