Find corresponding human 3'UTR regions for each Oligo variant from Griesemer et al. paper:

https://www.sciencedirect.com/science/article/pii/S0092867421009995

* consider only SNPs
* apparently, in Griesemer not only protein-coding genes were considered, so we loose some variants

In [1]:
import pandas as pd
import numpy as np

In [2]:
#clean human 3'UTR, see GRCh38_3_prime_UTR_clean.ipynb

human_utr_df = pd.read_csv('/s/project/mll/sergey/MLM/UTR_coords/GRCh38_3_prime_UTR_clean.bed', sep='\t', 
                       names = ['chrom','human_UTR_start','human_UTR_end','UTR_ID',
                               'score','strand','transcript_ID','canonical','HGNC_Symbol','UTR_len'])

In [3]:
human_utr_df.sort_values(by=['chrom','human_UTR_start'], inplace=True) #IMPORTANT to use searchsorted function below

human_utr_df.head()

Unnamed: 0,chrom,human_UTR_start,human_UTR_end,UTR_ID,score,strand,transcript_ID,canonical,HGNC_Symbol,UTR_len
564,chr1,70008,71585,ENST00000641515.2_utr3_2_0_chr1_70009_f,0,+,ENST00000641515,1.0,OR4F5,1577
565,chr1,944153,944574,ENST00000616016.5_utr3_13_0_chr1_944154_f,0,+,ENST00000616016,1.0,SAMD11,421
566,chr1,944202,944693,ENST00000327044.7_utr3_18_0_chr1_944203_r,0,-,ENST00000327044,1.0,NOC2L,491
567,chr1,965191,965719,ENST00000338591.8_utr3_11_0_chr1_965192_f,0,+,ENST00000338591,1.0,KLHL17,528
568,chr1,974575,975865,ENST00000379410.8_utr3_15_0_chr1_974576_f,0,+,ENST00000379410,1.0,PLEKHN1,1290


In [9]:
griesemer_bed = pd.read_csv('/s/project/mll/sergey/MLM/griesemer_data/Oligo_Variant_Info_GRCh38.bed', 
                            sep='\t',names=['chrom','var_start','var_end','ref','alt','mpra_variant_id','variant_id']).drop_duplicates() #supplementary info to the paper

is_snp = griesemer_bed.ref.str.len()==griesemer_bed.alt.str.len()

griesemer_bed = griesemer_bed[is_snp] #take only SNP variants

griesemer_bed.head()

Unnamed: 0,chrom,var_start,var_end,ref,alt,mpra_variant_id,variant_id
2,chr1,965591,965592,T,G,r,rs9697711
4,chr1,965642,965643,T,C,r,rs13303351
8,chr1,975013,975014,C,T,r,rs28477686
10,chr1,975028,975029,T,C,r,rs28536514
12,chr1,975057,975058,A,G,r,rs6685581


In [10]:
# for each oligo variant, look for human UTR region containing this variant

res = []

for chrom in griesemer_bed.chrom.unique():
    chrom_utr_df = human_utr_df[human_utr_df.chrom==chrom]
    for _, row in griesemer_bed[griesemer_bed.chrom==chrom].iterrows():
        var_start = row.var_start
        utr_idx = np.searchsorted(chrom_utr_df.human_UTR_start,var_start,'right')-1
        while utr_idx<len(chrom_utr_df) and var_start>=chrom_utr_df.iloc[utr_idx].human_UTR_start:
            if var_start<chrom_utr_df.iloc[utr_idx].human_UTR_end:
                row['UTR_ID'] = chrom_utr_df.iloc[utr_idx].UTR_ID
                res.append(row)
                break
            utr_idx+=1
        #else:
        #    print(row.chrom, row.var_start)

In [13]:
utr_variants = pd.DataFrame(res).merge(human_utr_df)

In [20]:
# get distance distribution between the variant and the stop codon

plus_genes = utr_variants[utr_variants.strand=='+']
minus_genes = utr_variants[utr_variants.strand=='-']

plus_genes['stop_codon_dist'] = plus_genes.var_start - plus_genes.human_UTR_start
minus_genes['stop_codon_dist'] = minus_genes.human_UTR_end - minus_genes.var_start

stop_codon_dist = pd.concat([plus_genes,minus_genes]).stop_codon_dist

stop_codon_dist.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  plus_genes['stop_codon_dist'] = plus_genes.var_start - plus_genes.human_UTR_start
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  minus_genes['stop_codon_dist'] = minus_genes.human_UTR_end - minus_genes.var_start


count     8148.000000
mean      1664.254664
std       1955.698236
min          0.000000
25%        392.000000
50%       1018.500000
75%       2175.500000
max      21917.000000
Name: stop_codon_dist, dtype: float64