# Get human 3'UTR coordinates

Take 3'UTR coordinates from hgTables

To obtain the 3' UTR positions of only coding genes, follow the below steps:

1. Navigate to the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables)
2. Select the hg38 assembly and the Genes and Gene Predictions group
3. Select the GENCODE Genes V20 track and choose the Basic table (should be the default)
4. If you have a particular region you are interested in, select that region using
the define regions box, otherwise choose "genome"
5. Click the button "create" next to "filter"
6. Allow filtering from the linked table wgEncodeGencodeAttrsV20
7. In the transcriptClass field under the "hg38.wgEncodeGencodeAttrsV20 based filters" section,
enter "coding" into the text box, so "transcriptClass does match coding", then click "submit"
8. Under output format choose "BED - browser extensible data", enter a name for your file, and
click "get output"
9. On the "Output wgEncodeGencodeBasicV20 as BED" page, choose "3' UTR Exons", and click "get BED"
to download your file

* add HGNC symbol (table from Biomart)
* remove irrelevant contigs (e.g. decoy sequences)
* take only contiguous UTR regions

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

mpl.rcParams.update({'xtick.labelsize': 16, 'ytick.labelsize': 16, 
                     'axes.titlesize':16, 'axes.labelsize':18})

In [24]:
data_dir = '/lustre/groups/epigenereg01/workspace/projects/vale/mlm/'

In [25]:
all_utr_df = pd.read_csv(data_dir + 'UTR_coords/GRCh38_3_prime_UTR.bed.gz', sep='\t', names=['chrom','start','stop','utr_name','score','strand']) #3'UTR coordinates from hgTables

gene_annot_df = pd.read_csv(data_dir + 'UTR_coords/GRCh38_EnsembleCanonical_HGNC.tsv.gz', sep='\t', skiprows=1,header=None,
                           names=['transcript_id','canonical','HGNC_symbol'],usecols=[1,2,3]) #matching between Ensembl and HGNC gene names

all_utr_df['transcript_id'] = all_utr_df.utr_name.apply(lambda x:x.split('.')[0]) #transcript ID from UTR ID

df = all_utr_df.merge(gene_annot_df)

df = df[(df.canonical==1) & (~df.HGNC_symbol.isna())]#take only Ensembl canonical and with HGNC symbol

df = df[~df.chrom.str.contains('_')] #exclude decoy seqeunces, etc

df.drop_duplicates(subset=['transcript_id'],keep=False,inplace=True) #use only single-exon UTRs

In [26]:
len(df)

18178

In [70]:
df.to_csv(data_dir + 'UTR_coords/GRCh38_3_prime_UTR_clean.bed',sep='\t',index=None,header=None)

In [36]:
#chromosomes should be sorted in numerical order! e.g. 1,2,3...X,Y
chroms = [f'chr{x}' for x in range(1,23)] + ['chrX','chrY']
df = df.set_index('chrom').loc[chroms].reset_index()

In [None]:
df.to_csv(data_dir + 'UTR_coords/GRCh38_3_prime_UTR_clean-sorted.bed',sep='\t',index=None,header=None)