Through these experiments, we want to show two issues in studying interaction between drugs and domains.
1. **First problem:** That when a drug interacts with a single-domain protein (with domain X), even if we correctly conclude that it interacts with domain X, It may not interact with another single domain protein that has domain X. This is easy to check using Data. For this, we need some negative interaction data and for that, we can go to affinity data.  
2. **Second Problem:** is about multi-domain proteins and that is when a drug is interacting with a multi-domain protein (with domains X and Y), we can’t confidently say if this drug interacts with X or Y or both or either or neither meaning several cases are possible:  
    - The drug interacts with protein because it interacts directly with X
    - The drug interacts with protein because it interacts directly with Y
    - The drug interacts with protein because X and Y are both present
    - The drug interacts with protein because either of X or Y are present
    - The drug interacts because X and Y are present and they are in certain configuration with respect to each other or other extrinsic properties of the protein besides existence of X and Y.
    - The drug interacts for a completely irrelevant reason to existence of X or Y. 
    
There might be some overlap between the problem-1 and problem-2. But conceptually, we can say that first problem arises when trying to go from a drug-domain interaction to drug-protein interaction and the second problem arises when we go in the reverse direction. We want to see if we can quantitatively assess how prevalent these problems are or at least illuminate them as much as possible.

# from positive interactions to negative
This means we infer drug-domain interactions from drug interactions of single domain proteins, and then find examples where the same domain occurs in other proteins but doesn't interact with same drugs (we have a negative interaction for it in our dataset). for this, the negative interactions are very important. common drug-target interaction databases only have positive interactions and they assume lack of a pair in the dataset to mean lack of interaction, which is obviously not correct. However there are some researches that also collect negative interaction data like [Coelho2016](https://doi.org/10.1371/journal.pcbi.1005219) where they have used affinity data to extract some negative interactions

# First attempt using Coelho2016 datset
This dataset is based on [Coelho2016](https://doi.org/10.1371/journal.pcbi.1005219) paper and contains negative and positive interactions. Negative interactions are extracted from BindingDB and BioLip databases, even though BioLip is questionable as a source of negative interactions because it is extracted from strucutres of drug-target complexes in the PDB, while we are more interested in those based on chemical assays.
To use this dataset to search for cases of problem-1, we create a table where for each pair of drug D and proteins P, where the protein is single-domain M, we list all other proteins Q that have the same domain M and divide them into three groups:
1. **Pos:** those that there is a positive interaction in the dataset between Q and D
1. **Neg:** those that there is a negative interaction in the dataset between Q and D
1. **Unk:** those that there is no interaction information in the dataset between Q and D

In [None]:

#first we read the dataset and 

interacts = dict() # for each pair in the dataset (key),  it shows the annotations True/False (interaction/non-interaction) if it exists in the dataset. we basically store all dataset infromation here.
uniprot_ids = set() # set of uniprot IDs for the purpose of collecting their pfam domain annotations
drugsof= dict() # we want positive interactions for single domain proteins so we store them here to be readily available

import pandas
for f in ["drugbank_DTIs_REAL_NEGS.txt","test_data_sc_and_bc.txt","yamanishi_DTIs_REAL_NEGS.txt"]:
    df = pandas.read_csv("DTIPred/"+f, sep = "\t", header = None)
    for index , row in df.iterrows():
        pid  = row[0]
        did  = row[1]
        interaction_exist  = row[2]
        uniprot_ids.add(pid)
#         if (pid,did) in interacts:
#             if (interacts[(pid,did)] != interaction_exist):
#                 print ("error repeat", (pid,did))
#         else:
        interacts[(pid,did)] = interaction_exist 
        if interaction_exist == 1:               
            if pid in drugsof:
                drugsof[pid].append(did)
            else:
                drugsof[pid] = [did]
        
with open ("DTIPred/uniprotids.txt", "w") as pf:
    pf.writelines("\n".join(uniprot_ids))
    
        
        

In [None]:

# here, we read the domain annotations we have downloaded from uniprot.

import pandas

def extract_items(field):
    if ";" not in field:
        return []
    else:
        spl = field.split(";")
        for s in spl:
            if len(s) <=1:
#                 print(s)
                spl.remove(s)
        return spl

proteinswith = dict() #for each domain (pfam ID), this will store the set of proteins (uniprot IDs) that have this domain
domainsof  = dict () #for each protein (uniprot ID), this will store the list of domains (pfam IDs) of that protein
df = pandas.read_csv("DTIPred/uniprotids_annnots.tab", sep = "\t", converters={i: str for i in range(100)})

for index , row in df.iterrows():
    domain_field = row ["Cross-reference (Pfam)"]
    pid = row["yourlist:M20210201A94466D2655679D1FD8953E075198DA83D46A3C"]    
    if True: #conditions for considering a protien such as being human protein or being reviewed
            domain_list = extract_items(domain_field)
            domainsof[pid]= domain_list 
            for dom  in domain_list:
                if dom in proteinswith:
                    proteinswith[dom].append(pid)
                else:
                    proteinswith[dom]= [pid]
        
    
num_domains = {x:len(domainsof[x]) for x in domainsof.keys()}
one_domain = [x for x in domainsof.keys() if len(domainsof[x])==1]

In [None]:

# here we do the calculations, meaning we prepare the table consisting of pairs of single domain proteins (P) and interacting drugs (D) the number of proteins falling to each of the three groups and the ID of these proteins are stored in the next columns

drug_level_examples = "onedomain-protein,domain,interacting_drug,num_pos,num_neg,num_unk,pos,neg,unk\n"
protein_level_exmples  = ""


for p in one_domain:
    m = domainsof[p][0]
    Q_set = proteinswith[m].copy()
    if p in Q_set:
        Q_set.remove(p)        
    if p in drugsof:
        D_set = drugsof[p]
        for d in D_set:
            negs = []
            pos = []
            unk= []
            for q in Q_set:
                if (q,d) in interacts:
                    if interacts[(q,d)]==1:
                        pos.append(q)
                    else:
                        negs.append(q)
                else:
                    unk.append(q)
            row_str= ",".join ([p,m,d,str(len(pos)),str(len(negs)),str(len(unk)),";".join(pos), ";".join(negs), ";".join(unk)])+"\n"
            drug_level_examples+= row_str
            
with open("result_drug_level.csv","w") as outf:
    outf.writelines(drug_level_examples)
    

The result of this experiments showed that we couldn't find occurance of the problem-1 with this dataset. This can be due to small number of negative interactions that we have which can be due the the dataset being old. Therefore, we recollect the negative interactions from BindingDB to do this experiment again.

# Second attempt using BindingDB
we downloaded the BindingDB in tsv format. There were few issues here. First of all, for affinity, there are several measures here including Ki, Kd, IC50, and EC50. The literature that use affinity to obtain negative interactions don't clarfiy which of these measures they have used except one preprint that says they use Ki or IC50, even though based on a search that I did Kd is the most relevant measure for durg binding to proteins. 
Another problem is that some of the rows (interactions) in the bindingDB don't have a uniprot ID or have multiple chains. these cases altogether constitute less than 13% of interactions in the dataset. So for now, we ignore them because it makes the life much easier. 

In [1]:
### read the dataset and store binary interactions or read these information from the pickled file


reload_dataset= False
interaction_threshold = 1000
noninteraction_threshold = 30000
blank_length_threshold = 30
bindingdb_path = "data/BindingDB_All_2021m0.tsv/BindingDB_All.tsv"
domain_data_dir = "temp/domain_data/"
pickled_data_path = "temp/pickled_data/data.pickle"


import os
import pandas 
import numpy
import pickle
import xmltodict
from os import path
import xml.etree.ElementTree as ET
import requests



def binarize_field(x,l_threshold,h_threshold):
    try:
        if type(x) != str:
            return "nonstring"
        else:
            x_c = x.replace(">","")
            x_c = x.replace("<","")
            x_C = x.replace(" ","")
            val  = float(x_c)
            if val == 0:
                return "zero"
            if val>h_threshold:
                val_bin = False
            elif val<l_threshold:
                val_bin = True
            else:
                return "middle"
            if val_bin and (">" in x):
                return "invalid"
            if (not val_bin) and ("<" in x):
                return "invalid"
            return val_bin
    except:
        return "error"

    
    
def binarize(rows):
    num_pos = 0
    num_neg = 0
    try:
        Ki_col= rows["Ki (nM)"].values 
        IC50_col= rows["IC50 (nM)"].values
    except:
        Ki_col= [rows["Ki (nM)"]]
        IC50_col= [["IC50 (nM)"]]
    for i in range(len(Ki_col)):
        bin_Ki = binarize_field(Ki_col[i],interaction_threshold, noninteraction_threshold)
        bin_IC50 = binarize_field(Ki_col[i],interaction_threshold, noninteraction_threshold)
        if type (bin_Ki)== bool:
            if bin_Ki:
                num_pos +=1
            else:
                num_neg +=1
        if type (bin_IC50)== bool:
            if bin_IC50:
                num_pos +=1
            else:
                num_neg +=1
    num_all = num_pos + num_neg
    if num_all == 0:
        return "undecided_none"
    if num_neg == 0:
        return True
    if num_pos == 0:
        return False
    pos_fraction = float(num_pos)/num_all
    neg_fraction = float(num_neg)/num_all
    if pos_fraction>0.5:
        return True
    if neg_fraction>0.5:
        return False
    return "undceided_conflict"






def binarize_strict(rows):
    num_pos = 0
    num_neg = 0
    try:
        Ki_col= rows["Ki (nM)"].values 
        IC50_col= rows["IC50 (nM)"].values
    except:
        Ki_col= [rows["Ki (nM)"]]
        IC50_col= [["IC50 (nM)"]]
    for i in range(len(Ki_col)):
        bin_Ki = binarize_field(Ki_col[i],interaction_threshold, noninteraction_threshold)
        bin_IC50 = binarize_field(Ki_col[i],interaction_threshold, noninteraction_threshold)
        if (type (bin_Ki)== bool) and (type (bin_IC50)== bool):
            if bin_Ki and bin_IC50:
                num_pos+=1
            elif (not bin_Ki) and (not bin_IC50):
                num_neg+=1            
        elif type (bin_Ki)== bool:
            if bin_Ki:
                num_pos +=1
            else:
                num_neg +=1
        elif type (bin_IC50)== bool:
            if bin_IC50:
                num_pos +=1
            else:
                num_neg +=1
    num_all = num_pos + num_neg
    if num_all == 0:
        return "undecided_none"
    if num_neg == 0:
        return True
    if num_pos == 0:
        return False
#     pos_fraction = float(num_pos)/num_all
#     neg_fraction = float(num_neg)/num_all
#     if pos_fraction>0.5:
#         return True
#     if neg_fraction>0.5:
#         return False
    return "undceided_conflict"

class StringConverter(dict):
    def __contains__(self, item):
        return True
    def __getitem__(self, item):
        return str
    def get(self, default=None):
        return str

def extract_items(field):
    if ";" not in field:
        return []
    else:
        spl = field.split(";")
        for s in spl:
            if len(s) <=1:
    #                 print(s)
                spl.remove(s)
        return spl


def pfam_record_dict(p):
    p_path = domain_data_dir + p + ".xml"
    if not path.exists(p_path):
        url = "http://pfam.xfam.org/protein/"+p+"?output=xml"
        req = requests.get(url)
        with open (domain_data_dir+prot+".xml","w") as outf:
            outf.writelines(req.text)
            
    with open(p_path) as pf:
        p_dict = xmltodict.parse(pf.read())
    return p_dict

def get_domains(p):
    p_dict = pfam_record_dict(p)
    p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
    domain_list =[]
    if type(p_domains) != list:
        p_domains =[p_domains]
    for dom in p_domains:
        acc = dom["@accession"]
        domain_list.append(acc)
    return domain_list

# def specie_name(p):
#     p_dict  = pfam_record_dict(p)
#     return p_dict["pfam"]["entry"]["taxonomy"]["@species_name"]


def check_single_domain(p):
    p_dict = pfam_record_dict(p)
    p_seq  = p_dict["pfam"]["entry"]["sequence"]["#text"]
    p_len  = len (p_seq)
    on_domain =[False]*p_len
    p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
    if type(p_domains) != list:
        p_domains =[p_domains]
    for dom in p_domains:
        acc = dom["@accession"]
        # type = dom["@type"]
        begin = int(dom["location"]["@start"])-1
        end = int(dom["location"]["@end"])-1
        on_domain[begin:end]  = [True] * ((end-begin)+1)
    streak = 0 
    max_streak = 0
    for i in range(p_len):
        if on_domain[i]:
            streak = 0            
        else:
            streak += 1
            max_streak  = max(streak, max_streak)
            
    if (max_streak > blank_length_threshold) or (len(p_domains)>1):
        return False
    else:
        return True
    
import sys
import io
binfileexist = os.path.exists(pickled_data_path)

if (not binfileexist) or reload_dataset:
#     col_names = pandas.read_csv("data/BindingDB_All_2021m0.tsv/BindingDB_All.tsv", sep = "\t", nrows=0).columns
    print ("Reading Binding DB file...",end = "")

    save_stderr = sys.stderr
    sys.stderr = open(os.devnull, 'w')
    df = pandas.read_csv(bindingdb_path, sep = "\t",error_bad_lines=False,converters=StringConverter())
    sys.stderr = save_stderr
    
    single_chain_mask = df["Number of Protein Chains in Target (>1 implies a multichain complex)"]=="1"
    hasswissprot_mask = numpy.logical_not(df["UniProt (SwissProt) Primary ID of Target Chain"].isna())
    haspubchemcid_mask = numpy.logical_not(df["PubChem CID"].isna())
    singlechain_idcomplete_mask  = single_chain_mask & hasswissprot_mask & haspubchemcid_mask
    easy_df = df.loc[singlechain_idcomplete_mask,:]
    
    
    
    interaction_idx = dict()
    idx = 0 
    for index, row in easy_df.iterrows():
        idx+=1
        print ("\r                                                     \rGrouping rows by drug-protein pairs "+str(int(idx*100/len(easy_df)))+"%",end = "")
        pid = row["UniProt (SwissProt) Primary ID of Target Chain"]
        did = row["PubChem CID"]
        if (pid != "") and (did != ""):
            if (pid,did) in interaction_idx:
                interaction_idx[(pid,did)].append(index)
            else:
                interaction_idx[(pid,did)] = [index]
            
    interact = dict()
    drugsof = dict()
    idx = 0
    included_proteins = set()
    for (p,d) in interaction_idx:
        idx+=1
        print ("\r                                                                          \rBinarizing "+str(int(idx*100/len(interaction_idx)))+"%",end = "")
        res = binarize_strict(easy_df.loc[interaction_idx[(p,d)],:])
        if type(res)== bool:
            interact[(p,d)] = res
            included_proteins.add(p)
            if res:
                if p in drugsof:
                    drugsof[p].add(d)
                else:
                    drugsof[p] = {d}

    
    issingledom = dict()
    proteinswith = dict() #for each domain (pfam ID), this will store the set of proteins (uniprot IDs) that have this domain
    domainsof  = dict () 
    included_proteins_with_domain = []
    for p in included_proteins:
        try:
            domain_list=get_domains(p)
            domainsof[p]= domain_list
            for dom  in domain_list:
                if dom in proteinswith:
                    proteinswith[dom].add(p)
                else:
                    proteinswith[dom]= set([p])
            issingledom[p] = check_single_domain(p)
            included_proteins_with_domain.append(p)
        except:
            a = 7
    single_domains =  [x for x in included_proteins_with_domain if issingledom[x]]
    with open(pickled_data_path, 'wb') as f:
        pickle.dump([interaction_idx, interact,drugsof,included_proteins_with_domain,proteinswith,domainsof,issingledom,single_domains], f)

else:# when the file exists
    with open(pickled_data_path, 'rb') as f:
        [interaction_idx, interact,drugsof,included_proteins_with_domain,proteinswith,domainsof,issingledom,single_domains] = pickle.load(f)



Now we find all the triples that can serve as examples for problem one

In [3]:
#Here we find the examples of problem 1. 

csv_report_str = "onedomain-protein,domain,interacting_drug,num_pos,num_neg,num_unk,pos,neg,unk\n"
# protein_level_exmples  = ""


problem1_pdq_triples =set()
problem1_drug_domain_pairs = set()
dataset1_drug_positive_set = set()
dataset1_drug_negative_set = set()
dataset1_protein_positive_set= set()
dataset1_protein_negative_set =set()
dataset1_drug_prtein_pairs = set()
problem1_domain_set = set()
problem1_drug_set = set()
domain_set = set()

for p in single_domains:
    m = domainsof[p][0]
    Q_set = proteinswith[m].copy()
    if p in Q_set:
        Q_set.remove(p)        
    if p in drugsof:
        D_set = drugsof[p]
        for d in D_set:
            negs = []
            pos = []
            unk= []
            for q in Q_set:
                if type(q)== float:
                    print("error-q:", q)
                if (q,d) in interact:
                    dataset1_drug_prtein_pairs.add((p,d))
                    dataset1_drug_prtein_pairs.add((q,d))
                    dataset1_protein_positive_set.add(p)
                    if interact[(q,d)]:
                        pos.append(q)
                        dataset1_protein_positive_set.add(q)
                        dataset1_drug_positive_set.add(d)
                    else:
                        negs.append(q)
                        problem1_drug_domain_pairs.add((m,d))
                        problem1_pdq_triples.add((p,d,q))
                        problem1_domain_set.add(m)
                        problem1_drug_set.add(d)
                        dataset1_protein_negative_set.add(q)
                        dataset1_drug_negative_set.add(d)
                else:
                    unk.append(q)
            if (type(p) == float):
                print("error-p:", p)
            if (type(m) == float):
                print("error-m:", m)
            if (type(d) == float):
                print("error-d:", d)                
            row_str= ",".join ([p,m,d,str(len(pos)),str(len(negs)),str(len(unk)),";".join(pos), ";".join(negs), ";".join(unk)])+"\n"
            csv_report_str += row_str
            
with open("outputs/problem1_10_30.csv","w") as outf:
    outf.writelines(csv_report_str)

Q_pos = dict()
Q_neg = dict()
for dom,dd in problem1_drug_domain_pairs:
    for prot in  proteinswith[dom]:
        if (prot, dd) in interact:
            if interact[(prot,dd)]:
                if (dom,dd) in Q_pos:
                    Q_pos[(dom,dd)].add(prot)
                else:
                    Q_pos[(dom,dd)] = set([prot])
            else:
                if (dom,dd) in Q_neg:
                    Q_neg[(dom,dd)].add(prot)
                else:
                    Q_neg[(dom,dd)] = set([prot])

## Removing Conflicting Cases: rc1
There are unreasonable cases in our dataset where a single domain protein (with our definition) interacts with a drug but another single domain protein with same domain doesn't interact. This can happen for different reasons.
1. There is an error in the dataset in one of the two cases.
2. These measurements are done in different conditions on the same (or very similar) proteins for example the pH or temperature are different. 
2. these two proteins though sharing same domain have some difference in the sequenc that causes them to show very different affinity to the same drug.

In this experiment, we want to remove such cases to see how many drug-domain pairs involving how many domains) will remain in the dataset. This will be stored in `problem1_drug_domain_pairs_rc` rc stands for removed conflicts


In [17]:
problem1_drug_domain_pairs_rc1 = set()
for dom,dd in problem1_drug_domain_pairs:
    valid_entry = True
    for q in Q_neg[(dom,dd)]:
        if check_single_domain(q):
            valid_entry = False
    if valid_entry:
        problem1_drug_domain_pairs_rc1.add((dom,dd))


In [10]:
len(pdq_triples)

785

In [18]:
len(problem1_drug_domain_pairs_rc1)

100

In [39]:
set([dom for (dom,drug) in problem1_drug_domain_pairs_rc1])

{'PF00001',
 'PF00067',
 'PF00089',
 'PF00135',
 'PF00186',
 'PF00194',
 'PF00303',
 'PF00962',
 'PF02931'}

## Removing Conflicting Cases: rc2
In the second way of removing coflict which is more strict, we assume that proteins that have same arrangement of domain/non-domain segments are equal. Meaning, we find the doman segments and also non-domain segements of length more than 30 residues and encode the protein based on arrangement of these segments.

In [36]:
def get_segment_encoding(prot):
    p_dict = pfam_record_dict(prot)
    p_seq  = p_dict["pfam"]["entry"]["sequence"]["#text"]
    p_len  = len (p_seq)
    on_domain =[False]*p_len
    p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
    if type(p_domains) != list:
        p_domains =[p_domains]
    last_end = 0
    segments = []
    for dom in p_domains:
        acc = dom["@accession"]
        begin = int(dom["location"]["@start"])
        end = int(dom["location"]["@end"])
        if begin-1-last_end > blank_length_threshold:
            segments+= ["B",acc]  
        else:
            segments+= [acc]
        last_end = end
    if p_len-end > blank_length_threshold:
        segments+= ["B"]
    return "_".join(segments)

problem1_drug_domain_pairs_rc2 = set()

for dom,dd in problem1_drug_domain_pairs:
    valid_entry = True
    for p in Q_pos[(dom,dd)]:
        for q in Q_neg[(dom,dd)]:
            if get_segment_encoding(p)== get_segment_encoding(q):
                valid_entry = False
    if valid_entry:
        problem1_drug_domain_pairs_rc2.add((dom,dd))


In [None]:
def get_segments(prot):
    p_dict = pfam_record_dict(prot)
    p_seq  = p_dict["pfam"]["entry"]["sequence"]["#text"]
    p_len  = len (p_seq)
    on_domain =[False]*p_len
    p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
    if type(p_domains) != list:
        p_domains =[p_domains]
    last_end = 0
    segments = []
    for dom in p_domains:
        acc = dom["@accession"]
        begin = int(dom["location"]["@start"])
        end = int(dom["location"]["@end"])
        if begin-1-last_end > blank_length_threshold:
            segments+= ["B["+str(last_end+1)+"-"+str(begin-1)+"]"]
            segments+= [acc+"["+str(begin)+"-"+str(end)+"]"]            
        else:
            segments+= [acc+"["+str(begin)+"-"+str(end)+"]"]
        last_end = end
    if p_len-end > blank_length_threshold:
        segments+= ["B["+str(end+1)+"-"+str(p_len)+"]"]
    return "_".join(segments)

In [34]:
get_segment_encoding("P20813")

'PF00067[31-488]'

In [37]:
len(problem1_drug_domain_pairs_rc2)

96

In [38]:
set([dom for (dom,drug) in problem1_drug_domain_pairs_rc2])

{'PF00001',
 'PF00067',
 'PF00089',
 'PF00135',
 'PF00186',
 'PF00194',
 'PF00303',
 'PF00962',
 'PF02931'}

## Createing report of found examples

since we learned that even with imposing these restirctions, we have close to one thoasand examples left, we generate the report based on these examples


In [12]:
import matplotlib.pyplot as plt
import xmltodict
import pandas
import numpy


class StringConverter(dict):
    def __contains__(self, item):
        return True
    def __getitem__(self, item):
        return str
    def get(self, default=None):
        return str

shared_dom_col= (0,0.608,0.62,0.8)
shared_dom_col_pos= (0,0.608,0.2,0.8)
shared_dom_col_neg = (0.62,0.1,0.1,0.8)
other_dom_col= (0.6,0.6,0.6,0.8)

    
def visualize(pos_set, neg_set,drug,shared_domain,issingdom,fig_name):
    pos_set  = list (pos_set)
    neg_set = list(neg_set)
    height = len(pos_set)+len(neg_set)
    fig = plt.figure(figsize=[8, height])
    ax = fig.add_subplot(111)
    max_seq_len = 0
    top_to_bottom_idx = 0
    y_labels = []
    y_ticks = []
    
    for i in range (len(pos_set)):
        vert_idx = height-top_to_bottom_idx
        p = pos_set[i]
        y_labels += [p]
        y_ticks += [vert_idx]       
        p_path = domain_data_dir+p+".xml"
        with open(p_path) as pf:
            p_dict = xmltodict.parse(pf.read())
        p_seq  = p_dict["pfam"]["entry"]["sequence"]["#text"]
        p_len  = len (p_seq)
        max_seq_len = max(max_seq_len, p_len)
        ax.hlines(vert_idx, 0, p_len, linewidth=2, color="grey")
        if (issingdom[p]):
            ax.scatter([0,p_len],[vert_idx,vert_idx],s=100,c = "k")
        p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
        if type(p_domains) != list:
            p_domains =[p_domains]
        for dom in p_domains:
            acc = dom["@accession"]
            # type = dom["@type"]
            begin = int(dom["location"]["@start"])
            end = int(dom["location"]["@end"])
            #if pfam_A
            if acc==shared_domain:
                col = shared_dom_col_pos
            else:
                col = other_dom_col
            ax.hlines(vert_idx, begin, end, linewidth=10, color=col)
        top_to_bottom_idx +=1
        
        
    for i in range (len(neg_set)):
        vert_idx = height-top_to_bottom_idx
        q = neg_set[i]
        y_labels += [q]
        y_ticks += [vert_idx]       
        q_path = domain_data_dir + q + ".xml"
        with open(q_path) as qf:
            q_dict = xmltodict.parse(qf.read())
        q_seq = q_dict["pfam"]["entry"]["sequence"]["#text"]
        q_len = len(q_seq)
        max_seq_len = max(max_seq_len,q_len)
        ax.hlines(vert_idx, 0, q_len, linewidth=2, color="grey")
        q_domains = q_dict["pfam"]["entry"]["matches"]["match"]
        if (issingdom[q]):
            ax.scatter([0,q_len],[vert_idx,vert_idx],s=100,c = "k")
        p_domains = p_dict["pfam"]["entry"]["matches"]["match"]
        if type(q_domains) != list:
            q_domains =[q_domains]
        for dom in q_domains:
            acc = dom["@accession"]
            # type = dom["@type"]
            begin = int(dom["location"]["@start"])
            end = int(dom["location"]["@end"])
            # if pfam_A
            if acc == shared_domain:
                col = shared_dom_col_neg
            else:
                col = other_dom_col
            ax.hlines(vert_idx, begin, end, linewidth=10, color=col)
        top_to_bottom_idx +=1
        
    h_rng = float(max_seq_len)
    h_margin = h_rng/10
    ax.set_xlim(-h_margin, h_rng+h_margin)
    ax.set_ylim(0.5, height+0.5)
    ax.set_yticks(y_ticks)
    ax.set_yticklabels(y_labels)
    fig.savefig(fig_name)
    plt.close(fig)
    


useful_cols_bindingdb = ["Ki (nM)","Kd (nM)","IC50 (nM)","EC50 (nM)","kon (M-1-s-1)","koff (s-1)","pH","Temp (C)"]
useful_cols_uniprot = ["Entry",	"Status","Proteomes","Entry name","Organism","Fragment","Length"]
mdfile = "# Potential examples for problem 1:\n"


prot_info_df = pandas.read_csv("temp/unique_uniprot_info.tab",sep= "\t")
prot_row_idx   = dict()
for index, row in prot_info_df.iterrows():
    pid = row["yourlist:M202103015C475328CEF75220C360D524E9D456CE1638CDK"]
    prot_row_idx[pid] = index
    

if not os.path.exists("outputs/problem_1_examples"):
    os.mkdir("outputs/problem_1_examples")    
for (dom,dd) in problem1_drug_domain_pairs:
    mdfile += "## domain_drug: "+dom+"_"+dd+"\n\n"
    mdfile += "### Positives\n\n"
    row_idxs = [prot_row_idx[p] for p in Q_pos[dom,dd]]
    pos_rows = prot_info_df.loc[row_idxs , useful_cols_uniprot]
    mdfile += pos_rows.to_markdown(index = False)+"\n\n"
    mdfile += "### Negatives\n\n"
    row_idxs = [prot_row_idx[p] for p in Q_neg[dom,dd]]
    neg_rows = prot_info_df.loc[row_idxs , useful_cols_uniprot]
    mdfile += neg_rows.to_markdown(index = False)+"\n\n"
    fig_name ="outputs/problem_1_examples/"+dom+"_"+dd+".svg"
    visualize(Q_pos[(dom,dd)], Q_neg[(dom,dd)],dd,dom,issingledom,fig_name)
    mdfile += "![]("+dom+"_"+dd+".svg)\n\n"



with open("outputs/problem_1_examples/doc.md", "w") as outf:
    outf.writelines(mdfile)


In [41]:
mdfile = "# Potential examples for problem 1:\n"

if not os.path.exists("outputs/problem_1_examples_rc2"):
    os.mkdir("outputs/problem_1_examples_rc2")    
for (dom,dd) in problem1_drug_domain_pairs_rc2:
    mdfile += "## domain_drug: "+dom+"_"+dd+"\n\n"
    mdfile += "### Positives\n\n"
    row_idxs = [prot_row_idx[p] for p in Q_pos[dom,dd]]
    pos_rows = prot_info_df.loc[row_idxs , useful_cols_uniprot]
    mdfile += pos_rows.to_markdown(index = False)+"\n\n"
    mdfile += "### Negatives\n\n"
    row_idxs = [prot_row_idx[p] for p in Q_neg[dom,dd]]
    neg_rows = prot_info_df.loc[row_idxs , useful_cols_uniprot]
    mdfile += neg_rows.to_markdown(index = False)+"\n\n"
    fig_name ="outputs/problem_1_examples_rc2/"+dom+"_"+dd+".svg"
    visualize(Q_pos[(dom,dd)], Q_neg[(dom,dd)],dd,dom,issingledom,fig_name)
    mdfile += "![]("+dom+"_"+dd+".svg)\n\n"



with open("outputs/problem_1_examples_rc2/doc.md", "w") as outf:
    outf.writelines(mdfile)

## Building dataset2
in dataset1, we had positive pair (p,d), negative pair (q,d), where p is single domain and q shares a domain with p. In dataset2, we want to extend this dataset to include all negative (q,d)s and all negative (p,e). which means negatives that have a drug or protein in common with a postive pair in dataset1

In [20]:
dataset2_proteinextended_negatives  = set()
dataset2_drugextended_negatives = set()
dataset2_proteinextended_positives  = set()
dataset2_drugextended_positives = set()

for (pp,dd) in interact:
    inter = interact[(pp,dd)]
    if pp in dataset1_protein_positive_set:
        if not inter:
            dataset2_proteinextended_negatives.add((pp,dd))          
    if pp in dataset1_protein_negative_set:
        if inter:
            dataset2_proteinextended_positives.add((pp,dd)) 
    if dd in dataset1_drug_positive_set:
        if not inter:
            dataset2_drugextended_negatives.add((pp,dd))   
    if dd in dataset1_drug_negative_set:
        if inter:
            dataset2_drugextended_positives.add((pp,dd))  
            


In [21]:
len(dataset2_drugextended_negatives)

436

In [22]:
len(dataset2_proteinextended_negatives)

5246

len(dataset2_drugextended_positives)

In [24]:
len(dataset2_proteinextended_positives)

29312

In [26]:
dataset2_negatives  = dataset2_drugextended_negatives.union(dataset2_proteinextended_negatives).difference(dataset1_drug_prtein_pairs)

In [27]:
len(dataset2_negatives)

4896

In [28]:
dataset2_positives  = dataset2_drugextended_positives.union(dataset2_proteinextended_positives).difference(dataset1_drug_prtein_pairs)

In [29]:
len(dataset2_positives)

13536

## Redundancy erduced cases
in this case, if we observe that some of the proteins in positive and in negatives have same layout of domains non-domain regions, we remove those from the negatives that ar

In [3]:
negative_all  = [(a,b) for (a,b) in interact if not interact[(a,b)]]

In [4]:
len(negative_all)

14015