# Human Genome Data Processing

This notebook creates the dataset needed to train a classification model on long promoter sequences from the human genome

#### Human Promoter Classification Long
This dataset will be constructed following the methods presented in [PromID: Human Promoter Prediction by Deep Learning](https://arxiv.org/pdf/1810.01414.pdf). (I could not find the exact dataset, else I would use it). The dataset will be constructed by taking TSS sites listed in the [EPDnew Database](ftp://ccg.vital-it.ch/epdnew/human/006/), locating these regions in the [NCBI Homo sapiens reference genome](https://www.ncbi.nlm.nih.gov/genome/51) and taking the sequence -500/500 around the TSS site. This is a more difficult classification problem, but also more representitive of how promoter classification would be used in a real setting.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/human/')

# Long Sequences Classification Data

This section extracts promoters from TSS sites in the EPDnew dataset. Promoters will be -500/500 from the TSS site. Negative examples will be randomly taken from regions between TSS sites. Similar to the PromID paper, 10% of the data will be used for testing. Of the remaining 90%, 10% of that will be used for validation.

In [5]:
fname = 'GCF_000001405.38_GRCh38.p12_genomic.gbff'

In [6]:
promoter_reference = pd.read_csv(path/'Hs_EPDnew_006_hg38.sga', sep='\t', 
                                header=None, names=['Ref', 'TSS', 'Location', 'Strand', 'V', 'Name'])
promoter_reference.drop('V', inplace=True, axis=1)

In [7]:
promoter_reference.head()

Unnamed: 0,Ref,TSS,Location,Strand,Name
0,NC_000001.11,TSS,959256,-,NOC2L_1
1,NC_000001.11,TSS,960633,+,KLHL17_1
2,NC_000001.11,TSS,966482,+,PLEKHN1_1
3,NC_000001.11,TSS,976681,-,PERM1_1
4,NC_000001.11,TSS,1000097,-,HES4_1


In [9]:
chroms = [GB for GB in SeqIO.parse(path/fname, "genbank") if GB.id in promoter_reference.Ref.unique()]

In [10]:
def extract_promoter(loc, orient, GB_file):
    start = loc - 500
    end = loc + 500
    promoter = GB_file[start:end]
    
    if orient == '-':
        promoter = promoter.reverse_complement()
        
    promoter = promoter.seq.__str__()
    
    if not 'N' in promoter:
        return promoter
    else:
        return None

In [11]:
def chromosome_to_promoter(GB, df):
    ref = GB.id
    data = df[df.Ref == ref].copy()
    if len(data) > 0:
        data['Sequence'] = data.apply(lambda x: extract_promoter(x['Location'], x['Strand'], GB), axis=1)
        return data

In [12]:
def chromosome_to_negative(GB, df):
    ref = GB.id
    data = df[df.Ref == ref].copy()
    if len(data) > 0:
        output = [get_negative(data, i, GB) for i in range(1, len(data)-1)]
        output = [i for i in output if not type(i) == type(None)]
        if len(output) > 0:
            output = np.concatenate(output)
        return output

In [13]:
def get_negative(inp_df, i, GB):
    seqs = []
    tss = inp_df.Location.iloc[i]
    prev_tss = inp_df.Location.iloc[i-1]
    next_tss = inp_df.Location.iloc[i+1]
    
    lowlow = prev_tss + 500
    lowhigh = tss - 500
    
    highlow = tss + 500
    highhigh = next_tss - 500
    
    range1 = lowhigh - lowlow
    range2 = highhigh - highlow
    
    if range1 > 1002:
        start = np.random.randint(lowlow, lowhigh-1000)
        rand_gene = GB[start:start+1000].seq.__str__()
        if not 'N' in rand_gene:
            seqs.append(rand_gene)
            
    if range2 > 1002:
        start = np.random.randint(highlow, highhigh-1000)
        rand_gene = GB[start:start+1000].seq.__str__()
        if not 'N' in rand_gene:
            seqs.append(rand_gene)
            
    if len(seqs) > 0:
        return np.array(seqs)

# Promoters

In [14]:
with ThreadPoolExecutor(8) as ex:
    outs = ex.map(lambda x: chromosome_to_promoter(x, promoter_reference), chroms)

In [15]:
dfs = list(outs)

In [17]:
sequences_df = pd.concat(dfs)

In [18]:
sequences_df.head()

Unnamed: 0,Ref,TSS,Location,Strand,Name,Sequence
0,NC_000001.11,TSS,959256,-,NOC2L_1,GCTGGCCCGGTCTCCGCGGATCGGAGGCGAAGCCAGCCTGGCCCTC...
1,NC_000001.11,TSS,960633,+,KLHL17_1,GAGGAGGAAGAGGGCGAGGCTTAGGGGGGCtccttggaggaggagg...
2,NC_000001.11,TSS,966482,+,PLEKHN1_1,CCTTGCCCCCGAGTGCGCTGACTGTCTTGGCCGTCTAGGGGGCATG...
3,NC_000001.11,TSS,976681,-,PERM1_1,GGGAGGCGGTTCCCGGGGTTGGTGGGGGGAGCGGGAGGCGGTTCCC...
4,NC_000001.11,TSS,1000097,-,HES4_1,GGACCGGAGTGGGGACGGGCGGAGGAAGCCAAGAGGCTCGAGACCG...


In [19]:
sequences_df.shape

(29598, 6)

# Negatives

In [20]:
with ThreadPoolExecutor(8) as ex:
    outs = ex.map(lambda x: chromosome_to_negative(x, promoter_reference), chroms)

In [21]:
negatives = list(outs)

In [22]:
negs = [i for i in negatives if not type(i) == type(None)]
negs = [i for i in negs if len(i) > 1]

In [23]:
negs = np.concatenate(negs)

In [25]:
neg_df = pd.DataFrame(negs, columns=['Sequence'])

In [26]:
neg_df.head()

Unnamed: 0,Sequence
0,AGGGTGCCCTGTACGTGGCAGGGGGCAACGACGGCACCAGCTGCCT...
1,CTGACCTGCCCCTCCGCCCCTCCATTCAGGGGCCTCTCCAGGAGCC...
2,AGCCAGGGTGCCCCGAGGAGGAGGGTGGGTGGGTCCTTGTGTGGCC...
3,CGGGgaccccacccccctccccaccctgatCCTCGCAGCCGGCTCT...
4,GATGACTTTCACCTACTATTCAGCAGAAAACCAAAAGCCAAGATAA...


In [27]:
neg_df['Target'] = 0

# Concat

In [28]:
seq_data = sequences_df.Sequence

In [29]:
seq_df = pd.DataFrame(seq_data)

In [30]:
seq_df['Target'] = 1

In [31]:
classification_df = pd.concat([seq_df, neg_df])

In [32]:
classification_df.head()

Unnamed: 0,Sequence,Target
0,GCTGGCCCGGTCTCCGCGGATCGGAGGCGAAGCCAGCCTGGCCCTC...,1
1,GAGGAGGAAGAGGGCGAGGCTTAGGGGGGCtccttggaggaggagg...,1
2,CCTTGCCCCCGAGTGCGCTGACTGTCTTGGCCGTCTAGGGGGCATG...,1
3,GGGAGGCGGTTCCCGGGGTTGGTGGGGGGAGCGGGAGGCGGTTCCC...,1
4,GGACCGGAGTGGGGACGGGCGGAGGAAGCCAAGAGGCTCGAGACCG...,1


In [33]:
classification_df.shape

(70169, 2)

In [36]:
classification_df.reset_index(inplace=True, drop=True)

In [38]:
len(classification_df.Sequence[0])

1000

Some errors slipped through

In [6]:
classification_df[classification_df.Sequence.map(lambda x: type(x) == type(1.0))]

Unnamed: 0,Sequence,Target
14533,,1


In [8]:
classification_df.drop(14533, inplace=True)

In [18]:
classification_df[~classification_df.Sequence.map(lambda x: set(x.upper()) == set('ATGC'))]

Unnamed: 0,Sequence,Target
50811,gagttgaagccctaaccctcaataaacctgtatttggagatagagc...,0


In [19]:
classification_df.drop(50811, inplace=True)
classification_df.reset_index(inplace=True, drop=True)

In [8]:
def partition_data(df):
    
    train_size = int(len(df)*0.9*0.9)
    valid_size = int(len(df)*0.9) - train_size
    
    train_df = df.sample(train_size)
    test_val = df.drop(train_df.index)
    valid_df = test_val.sample(valid_size)
    test_df = test_val.drop(valid_df.index)
    train_df['set'] = 'train'
    valid_df['set'] = 'valid'
    test_df['set'] = 'test'
    
    return (train_df, valid_df, test_df)

In [9]:
pos_df = classification_df[classification_df.Target == 1]
neg_df = classification_df[classification_df.Target == 0]

In [10]:
t1, v1, test1 = partition_data(pos_df)
t2, v2, test2 = partition_data(neg_df)

In [11]:
data_df = pd.concat([t1, t2, v1, v2, test1, test2])

In [12]:
data_df.head()

Unnamed: 0,Sequence,Target,set
5122,ccaGTTGAAAAGTAGAGGCCGAGGACAGAGTTAGACACTCGTTGTC...,1,train
11757,ggaagggcgCAAGAGAGGATCAGGGGTCAGCGGCACACCCATGGAG...,1,train
5822,TAAAGAAATACAAGGATTCCTCAAGCCCCTCTTCCCTAAAACATGC...,1,train
20025,CGCGGGGCCGGGGAAGCCCGCGCGCGTCATCAGCAGCGGCGCCGCG...,1,train
14727,TACACAGTAAGGACAGCCGCTGGAGCGCTACGGTCTGACGAACGAG...,1,train


In [14]:
data_df[data_df.set == 'train'].shape, data_df[data_df.set == 'valid'].shape, data_df[data_df.set == 'test'].shape, data_df.shape

((56834, 3), (6316, 3), (7017, 3), (70167, 3))

In [15]:
data_df.to_csv(path/'human_promoters_long.csv', index=False)