# Xenopus Laevis Annotations and transcriptome generation for ChAR-seq

In this notebook, we prepare the "annotation files" required by tagtools to annotate the reads with transcript IDs and names. We aslo prepare a transcriptome form the gff3 file and the genome fa.

The starting point is a GFF3 file for Xenopus Laevis. GFF3 files for different species and from different databases are organized somewhat differently, so one needs to modify the steps below according to the specifics of the GFF file. Here, we take care of the Xenla database for XLaevis 9.2.


## XLaevis, GFF3 from Xenbase, database XENLA_9.2

### Download GFF3


```bash
# Let's say we downloaded the Xenla genome in $GENOMES_ROOT/xenopus_laevis/v9.2

GENOMES_ROOT = "<root_folder_of_genomes>"
cd "${GENOMES_ROOT}/xenopus_laevis/v9.2"
mkdir annotations_xenbase
cd annotations_xenbase
curl -o XENLA_9.2_Xenbase.gff3  http://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_Xenbase.gff3

# OR vis ftp
# wget ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_Xenbase.gff3
```

### Make an annotation table

First we need to make a simple "annotation" table, which is a simple text file which links a transcript ID to a gene name, type, strand and length. This table will be used by tagtools for annotating the reads with actual transcript id and name. The file should look like this (taken from human `genconde.v29.tableGENES.withStrand.txt`)

```txt
ENST00000000233.9       ENSG00000004059.10      ARF5    protein_coding  +       3360
ENST00000000412.7       ENSG00000003056.7       M6PR    protein_coding  -       9590
ENST00000000442.10      ENSG00000173153.13      ESRRA   protein_coding  +       11160
```

For the Xenla, the desired features have the following labels: 
lnc_RNA
mRNA
ncRNA
rRNA
snRNA
snoRNA
tRNA
telomerase_RNA
transcript
 as we can see from 

```bash
cat XENLA_9.2_Xenbase.gff3 | grep -v "#" | cut -f3 | sort | uniq
```

Therefore the proper command is 

```bash
awk -F $'\t' 'BEGIN{types["mRNA"]=1; types["ncRNA"]=1; types["lnc_RNA"]=1; types["telomerase_RNA"]=1; types["snRNA"]=1; types["snoRNA"]=1; types["transcript"]=1; OFS="\t"}(/^#/){next;}($3 in types){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; s=$5-$4; if (length(z["gene"])==0){z["gene"]=z["ID"];}; print z["ID"], z["Parent"], z["gene"], $3, $7, s}' XENLA_9.2_Xenbase.gff3 > XENLA_9.2_tableTX.txt
```

This file looks like
```text
rna87247	gene43722	LOC108705380	mRNA	-	2584
rna96750	gene50311	LOC108705381	mRNA	-	29344
rna31304	gene10772	rna31304	mRNA	+	8006
rna31305	gene10772	LOC108705386	mRNA	+	7148
rna78199	gene37470	LOC108705383	mRNA	+	70477
rna74114	gene34519	LOC108705385	lnc_RNA	+	6552
```

### Make the exons dictionnary
In order to convert transcriptome coordinates to genomic coordinates, tagtools require a python dictionnary which contains the location of the splice junctions for each transcript. For now, we are doing this manually by adjusting the definition of the function below to account for the features of the specific gtf file we are interested in. This function will create a python dictionnary which we will save as a pickle file (this will be changed into json file in a future release) 

In [11]:
import subprocess
import numpy as np
import os
def make_exons_dict_fromGFF(gff_file, nmax=0):
    m={}
    
    cmd = "cat "+gff_file+""" | awk -F $'\t' 'BEGIN{types["mRNA"]=1; types["ncRNA"]=1; types["lnc_RNA"]=1; types["telomerase_RNA"]=1; types["snRNA"]=1; types["snoRNA"]=1; types["transcript"]=1; types["gene"]=1; types["tRNA"]=1; types["rRNA"]=1; types["C_gene_segment"]=1; types["V_gene_segment"]=1; OFS="\t"}(/^#/){next;}($3 in types)"""
    cmd+="""{delete xarr; delete yy; delete zz; x=$9; split(x,xarr,\";\"); for (i=1; i<=length(xarr); i++)"""
    cmd+=""" {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["ID"], $4, $5, $7, $1}'"""
    
    print(cmd)
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, bufsize=1, universal_newlines=True, start_new_session=True)
    nok=0
    for _, line in enumerate(p.stdout):
        if nmax>0 and nok>nmax:
            break
        read_data=line.strip().split("\t") 
        T_id=read_data[0] #transcript I
        
        if T_id in m:
            print("oops, already found")
#             m[T_id]+=[read_data[3],read_data[1],read_data[2]]
        else:
            nok+=1
            readstrand=int(1) if read_data[3]=="+" else int(-1)
            m[T_id]=[readstrand,int(read_data[1]),int(read_data[2]),read_data[4]]
            
    p.terminate()
    del p
    
    m2={}
    cmd = "cat "+gff_file+""" | awk -F $'\t' 'BEGIN{OFS="\t"}((!/^#/) && $3=="exon")"""
    cmd+="""{delete xarr; delete yy; delete zz; x=$9; split(x,xarr,\";\"); for (i=1; i<=length(xarr); i++)"""
    cmd+=""" {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["Parent"], $4, $5}'"""
    print(cmd)
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, bufsize=1, universal_newlines=True, start_new_session=True)
    nok=0
    for _, line in enumerate(p.stdout):
        if nmax>0 and nok>nmax:
            break
        read_data=line.strip().split("\t") 
        T_id=read_data[0] #transcript I
        
        vals=m2.pop(T_id,[])
        if len(vals)>0:
            
            m2[T_id]=np.vstack([vals,np.array([int(read_data[1]),int(read_data[2])])])
        else:
            nok+=1
            m2[T_id]=np.array([[int(read_data[1]),int(read_data[2])]])
    
    p.terminate()
    del p
    
    for k, v in m.items():
        vals=m2.pop(k,[])
        if len(vals)>0:
            if v[0]==1:
                a=np.argsort(vals[:,0])
                m2[k]=[v[3],np.insert(np.cumsum(vals[a,1]-vals[a,0]+1),0,0),vals[a,0]]
            else:
                a=np.argsort(-vals[:,1])
                m2[k]=[v[3],np.insert(np.cumsum(vals[a,1]-vals[a,0]+1),0,0),-vals[a,1]]
        else:
            m2[k]=[]
    
    m3={k:v for k, v in m2.items() if len(v)>0}
    
    
    return m3

In [12]:
xenla_exons=make_exons_dict_fromGFF("XENLA_9.2_Xenbase.gff3")

cat XENLA_9.2_Xenbase.gff3 | awk -F $'	' 'BEGIN{types["mRNA"]=1; types["ncRNA"]=1; types["lnc_RNA"]=1; types["telomerase_RNA"]=1; types["snRNA"]=1; types["snoRNA"]=1; types["transcript"]=1; types["gene"]=1; types["tRNA"]=1; types["rRNA"]=1; types["C_gene_segment"]=1; types["V_gene_segment"]=1; OFS="	"}(/^#/){next;}($3 in types){delete xarr; delete yy; delete zz; x=$9; split(x,xarr,";"); for (i=1; i<=length(xarr); i++) {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["ID"], $4, $5, $7, $1}'
cat XENLA_9.2_Xenbase.gff3 | awk -F $'	' 'BEGIN{OFS="	"}((!/^#/) && $3=="exon"){delete xarr; delete yy; delete zz; x=$9; split(x,xarr,";"); for (i=1; i<=length(xarr); i++) {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["Parent"], $4, $5}'


Verify that the exons dictionnary looks fine

In [14]:
{k:v for i, (k,v) in enumerate(xenla_exons.items()) if ((i>-1) & (i<150))}

{'gene34525': ['MT', array([ 0, 69]), array([9038])],
 'gene34778': ['MT', array([  0, 819]), array([2205])],
 'gene36167': ['MT', array([ 0, 71]), array([-7226])],
 'gene37689': ['MT', array([ 0, 65]), array([13716])],
 'gene38299': ['Scaffold100', array([  0, 943]), array([35531])],
 'gene40118': ['MT', array([ 0, 69]), array([-17553])],
 'gene40294': ['MT', array([ 0, 69]), array([7015])],
 'gene40993': ['MT', array([ 0, 69]), array([5910])],
 'gene41118': ['MT', array([ 0, 69]), array([11905])],
 'gene42065': ['MT', array([ 0, 69]), array([2136])],
 'gene42179': ['MT', array([ 0, 70]), array([-7394])],
 'gene43253': ['MT', array([ 0, 75]), array([4724])],
 'gene43286': ['MT', array([ 0, 70]), array([11492])],
 'gene44223': ['MT', array([ 0, 74]), array([13781])],
 'gene44770': ['MT', array([   0, 1631]), array([3093])],
 'gene44846': ['MT', array([ 0, 75]), array([9797])],
 'gene46399': ['MT', array([ 0, 69]), array([-7154])],
 'gene47030': ['MT', array([ 0, 71]), array([-5910])],


Now we can save this dictionnary

In [15]:
import pickle
with open("XENLA_9.2_annotation.gff3.pickle", 'wb') as handle:
    pickle.dump(xenla_exons, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Make an annotation table and exons dictionnary for the gene bodies

We adjust the annotation table generation to get the gene bodies as opposed to transcripts. The genebodies are the lines with 3rd field==gene as seen by

```bash
 grep "gbkey=Gene"  XENLA_9.2_Xenbase.gff3 | cut -f3,3 | sort | uniq
```

so we need

```bash
awk -F $'\t' 'BEGIN{OFS="\t"}(/^#/){next;}($3=="gene"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; if (length(z["gene"])==0){z["gene"]=z["ID"];}; s=$5-$4; print z["ID"], z["ID"], z["gene"], z["gene_biotype"], $7, s}' XENLA_9.2_Xenbase.gff3 > XENLA_9.2_tableGENEBODIES.txt
```

Verify file integrity with
```bash
awk -F $'\t' '(length($1)==0 || length($2)==0 || length($3)==0 || length($4)==0 || length($5)==0 || length($6)==0 || NF<6)' XENLA_9.2_tableGENEBODIES.txt
```

And now the exons dictionnary

In [8]:
def make_genebodies_dict_fromGFF(gff_file, nmax=0):
    m={}
    
    cmd = "cat "+gff_file+""" | awk -F $'\t' 'BEGIN{OFS="\t"}(/^#/){next;}($3=="gene")"""
    cmd+="""{delete xarr; delete yy; delete zz; x=$9; split(x,xarr,\";\"); for (i=1; i<=length(xarr); i++)"""
    cmd+=""" {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["ID"], $4, $5, $7, $1}'"""
    
    print(cmd)
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, bufsize=1, universal_newlines=True, start_new_session=True)
    nok=0
    for _, line in enumerate(p.stdout):
        if nmax>0 and nok>nmax:
            break
        read_data=line.strip().split("\t") 
        T_id=read_data[0] #transcript I
        
        if T_id in m:
            print("oops, already found")
#             m[T_id]+=[read_data[3],read_data[1],read_data[2]]
        else:
            nok+=1
            readstrand=int(1) if read_data[3]=="+" else int(-1)
            m[T_id]=[readstrand,int(read_data[1]),int(read_data[2]),read_data[4]]
            
    p.terminate()
    del p
    
    m2={}
    cmd = "cat "+gff_file+""" | awk -F $'\t' 'BEGIN{OFS="\t"}(/^#/){next;}($3=="gene")"""
    cmd+="""{delete xarr; delete yy; delete zz; x=$9; split(x,xarr,\";\"); for (i=1; i<=length(xarr); i++)"""
    cmd+=""" {split(xarr[i],yy,"="); zz[yy[1]]=yy[2];}; print zz["ID"], $4, $5}'"""
    print(cmd)
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, bufsize=1, universal_newlines=True, start_new_session=True)
    nok=0
    for _, line in enumerate(p.stdout):
        if nmax>0 and nok>nmax:
            break
        read_data=line.strip().split("\t") 
        T_id=read_data[0] #transcript I
        
        vals=m2.pop(T_id,[])
        if len(vals)>0:
            
            m2[T_id]=np.vstack([vals,np.array([int(read_data[1]),int(read_data[2])])])
        else:
            nok+=1
            m2[T_id]=np.array([[int(read_data[1]),int(read_data[2])]])
    
    p.terminate()
    del p
    
    for k, v in m.items():
        vals=m2.pop(k,[])
        if len(vals)>0:
            if v[0]==1:
                a=np.argsort(vals[:,0])
                m2[k]=[v[3],np.insert(np.cumsum(vals[a,1]-vals[a,0]+1),0,0),vals[a,0]]
            else:
                a=np.argsort(-vals[:,1])
                m2[k]=[v[3],np.insert(np.cumsum(vals[a,1]-vals[a,0]+1),0,0),-vals[a,1]]
        else:
            m2[k]=[]
    
    return m2

In [None]:
xenla_genebodies=make_genebodies_dict_fromGFF("XENLA_9.2_Xenbase.gff3")

In [10]:
{k:v for i, (k,v) in enumerate(xenla_genebodies.items()) if i<5}

{'gene11732': ['chr9_10L', array([    0, 20135]), array([17229350])],
 'gene17849': ['chr6S', array([    0, 17355]), array([-8437766])],
 'gene25214': ['chr9_10L', array([    0, 16132]), array([97703976])],
 'gene29874': ['chr2S', array([   0, 2250]), array([62437567])],
 'gene45651': ['chr7L', array([   0, 4450]), array([-78606500])]}

Everything looks fine so we can save it

In [12]:
with open("XENLA_9.2_annotationGENEBODIES.gff3.pickle", 'wb') as handle:
    pickle.dump(xenla_genebodies, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Make fasta

Teh steps below are to generate transcriptome fasta for Salmon. For human, we can just download the transcriptome. For xenla, we couldn't find the transcriptome online, so we need to generate it from the gff file. And it's a pain. First convert gff to bed file with tx definitions

```bash
awk -F $'\t' 'BEGIN{types["mRNA"]=1; types["ncRNA"]=1; types["lnc_RNA"]=1; types["telomerase_RNA"]=1; types["snRNA"]=1; types["snoRNA"]=1; types["transcript"]=1; OFS="\t"}(/^#/){next;}($3 in types){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; if (substr(z["gene"])==0){z["gene"]=z["ID"];}; print $1, $4, $5, z["ID"]" gene="z["gene"]" biotype="$3, ".", $7}' XENLA_9.2_Xenbase.gff3 > XENLA_9.2_TX.bed
```

Now make fa file

```bash
bedtools getfasta -name -fi ../XL9_2.fa -bed XENLA_9.2_TX.bed >XENLA_9.2_transcriptome.fa
```

The problem with the above method, is that this fasta acutally contains the introns... We need to make afasta file without introns... which is what salmon wants!

```bash
cat XENLA_9.2_Xenbase.gff3 | awk -F $'\t' 'BEGIN{OFS=""; ORS=""; curr_rna=""}(/^#/){next;}($3=="exon"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; parent=z["Parent"]; if (substr(parent,1,3)=="rna"){if (parent==curr_rna){range=$1":"$4"-"$5; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody.awk");} else {print "\n>"parent"\n"; range=$1":"$4"-"$5; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody.awk"); curr_rna=parent};};}' | sed -e "1d" >XENLA_9.2_transcriptome_NOINTRONS.fa
```

This above is very slow because it is making many system calls to samtools. Instead, let's speed it up by combining the exonic ranges of a single gene into a single system call

```bash
cat XENLA_9.2_Xenbase.gff3 | awk -F $'\t' 'BEGIN{OFS=""; ORS=""; curr_rna="";}(/^#/){next;}($3=="exon"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; parent=z["Parent"]; if (substr(parent,1,3)=="rna"){if (parent==curr_rna){range=range" "$1":"$4"-"$5;} else {if (length(curr_rna)>0){print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}; curr_rna=parent; range=$1":"$4"-"$5;}}}END{print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}' >XENLA_9.2_transcriptome_NOINTRONS_fastcompute.fa
```

Ok, now this fasta doesnt't have solo exons, that is exons whose parents is a gene not a transcript. We make a fasta for the solo exons by picking the ones whose parent starts with "gene", as well as the whose parent start

```bash
cat XENLA_9.2_Xenbase.gff3 | awk -F $'\t' 'BEGIN{OFS=""; ORS=""; curr_rna="";}(/^#/){next;}($3=="exon"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; parent=z["Parent"]; if (substr(parent,1,4)=="gene"){if (parent==curr_rna){range=range" "$1":"$4"-"$5;} else {if (length(curr_rna)>0){print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}; curr_rna=parent; range=$1":"$4"-"$5;}}}END{print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}' >XENLA_9.2_transcriptome_NOINTRONS_fastcompute_soloExons.fa

cat XENLA_9.2_Xenbase.gff3 | awk -F $'\t' 'BEGIN{OFS=""; ORS=""; curr_rna="";}(/^#/){next;}($3=="exon"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; parent=z["Parent"]; if (substr(parent,1,2)=="id"){if (parent==curr_rna){range=range" "$1":"$4"-"$5;} else {if (length(curr_rna)>0){print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}; curr_rna=parent; range=$1":"$4"-"$5;}}}END{print ">"curr_rna"\n"; system("samtools faidx ../XL9_2.fa "range" | awk -f awkbody_fastcompute.awk"); print "\n"}' >XENLA_9.2_transcriptome_NOINTRONS_fastcompute_idTypes.fa
```

Finally combine all these fasta files into a final transcriptome

```bash
cat XENLA_9.2_transcriptome_NOINTRONS_fastcompute.fa XENLA_9.2_transcriptome_NOINTRONS_fastcompute_soloExons.fa XENLA_9.2_transcriptome_NOINTRONS_fastcompute_idTypes.fa >XENLA_9.2_transcriptome_NOINTRONS_fastcompute_ALL.fa
```

## Make all the other tables

- annotations: chrdb -> go get it in the star index folder `chrNameLength.txt`
- annotations: chrdb -> `awk -F $'\t' 'BEGIN{types["mRNA"]=1; types["ncRNA"]=1; types["lnc_RNA"]=1; types["telomerase_RNA"]=1; types["snRNA"]=1; types["snoRNA"]=1; types["transcript"]=1; OFS="\t"}(/^#/){next;}($3 in types){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; s=$5-$4; print z["ID"], z["Parent"], $7, $4, s}' XENLA_9.2_Xenbase.gff3 >XENLA_9.2_TXDB.txt`
- annotations: genedb -> `awk -F $'\t' 'BEGIN{OFS="\t"}(/^#/){next;}($3=="gene"){delete x; delete y; delete z; split($9, x,";"); for (i = 0; ++i <= length(x);){split(x[i],y,"="); z[y[1]]=y[2];}; if (length(z["gene"])==0){z["gene"]=z["ID"];}; s=$5-$4; print z["ID"], z["gene_biotype"], z["gene"], $1, $7, $4}' XENLA_9.2_Xenbase.gff3 >XENLA_9.2_GENESDB.txt`

# Generate Bowtie, Star and Salmon indices
See 02_make_aligner_indexes_XenopusLaevis.ipynb