# Integration of ClinGen Gene-Disease Validity Data into WikiData

ClinGen (Clinical Genome Resource) develops curated data of genetic associations <br>
CC0 https://clinicalgenome.org/docs/terms-of-use/

This scheduled bot operates through WDI to integrate ClinGen Gene-Disease Validity Data <br>
https://github.com/SuLab/GeneWikiCentral/issues/116 <br>
https://search.clinicalgenome.org/kb/gene-validity/ <br>

Python script contributions, in order: Sabah Ul-Hasan, Andra Waagmeester, Andrew Su

## Checks

- **Unsure if login is automatically aligning with given environment
- create_reference() adds refs to existing HGNC or MONDO value in genetic association statement (doesn't overwrite URLs from non-ClinGen sources)
- For loop checks for both HGNC Qid and MONDO Qid per each row (ie if HGNC absent or multiple, then checks MONDO)
- **Better way than current nested approach...? ^
- For loop puts correct Qid for either HGNC or MONDO, if available 
- **No Qid logged for either if Classification != Definitive (confirming we are fine with this, I say yes)
- **For loop should work with multiple Qid option, tested by switching for elif but A2ML1 attempt doesn't work https://www.wikidata.org/wiki/Q18051234
- **Playing around with 'update_retrieved_if_new_multiple_refs' (see for loop)... unsure if anything is happening + need to adjust to 365 days instead of 180
Example: https://github.com/SuLab/scheduled-bots/blob/6bf99c5287280ee473236f95ff412bd5eb75b814/scheduled_bots/civic/bot.py <br>
more info https://github.com/SuLab/WikidataIntegrator/tree/master/wikidataintegrator/ref_handlers

## To Do

1) Update across entire dataframe <br>
2) Share full output file with ClinGen <br>
3) Set up scheduled bot through proteinboxbot (update login) <br>
4) Run in jenkins: http://jenkins.sulab.org/

In [76]:
### Relevant modules and libraries

# Installations by shell 
!pip install --upgrade pip # Installs pip, ensures it's up-to-date
!pip3 install tqdm # Visualizes installation progress (progress bar)
!pip3 install termcolor # For color-coding printed output
!pip3 install wikidataintegrator # For wikidata

# Installations by python
from wikidataintegrator import wdi_core, wdi_login # Core and login from wikidataintegrator module
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs # For retrieving references
import copy # Copies references needed in the .csv for uploading to wikidata
from datetime import datetime # For identifying the current date and time

import os # OS package to ensure interaction between the modules (ie WDI) and current OS being used

import pandas as pd # Pandas for data organization, then abbreviated to pd
import numpy as np # Another general purpose package
from termcolor import colored # Imports colored package from termcolor

Requirement already up-to-date: pip in /srv/paws/lib/python3.6/site-packages (19.3.1)


In [69]:
### ClinGen gene-disease validity data

# Read as csv
df = pd.read_csv('https://search.clinicalgenome.org/kb/gene-validity.csv', skiprows=6, header=None)  

# Label column headings
df.columns = ['Gene', 'HGNC Gene ID', 'Disease', 'MONDO Disease ID','SOP','Classification','Report Reference URL','Report Date']

# Create time stamp of when downloaded (error if isoformat() used)
timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")

df.head(6) # View first 6 rows

Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,SOP,Classification,Report Reference URL,Report Date
0,A2ML1,HGNC:23336,Noonan syndrome with multiple lentigines,MONDO_0007893,SOP5,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,2018-06-07T14:37:47.175Z
1,A2ML1,HGNC:23336,cardiofaciocutaneous syndrome,MONDO_0015280,SOP5,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,2018-06-07T14:31:03.696Z
2,A2ML1,HGNC:23336,Costello syndrome,MONDO_0009026,SOP5,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,2018-06-07T14:34:05.324Z
3,A2ML1,HGNC:23336,Noonan syndrome,MONDO_0018997,SOP5,Disputed,https://search.clinicalgenome.org/kb/gene-vali...,2018-06-07T14:23:53.157Z
4,A2ML1,HGNC:23336,Noonan syndrome-like disorder with loose anage...,MONDO_0011899,SOP5,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,2018-06-07T14:40:11.599Z
5,AARS,HGNC:20,undetermined early-onset epileptic encephalopathy,MONDO_0018614,SOP6,Limited,https://search.clinicalgenome.org/kb/gene-vali...,2018-11-20T17:00:00.000Z


### Update to entire dataframe (all subsetdf to df throughout) after bot is approved as satisfactory

In [70]:
subsetdf = df[15:20] # Subset and rename as subsetdf
subsetdf

Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,SOP,Classification,Report Reference URL,Report Date
15,ACADVL,HGNC:92,very long chain acyl-CoA dehydrogenase deficiency,MONDO_0008723,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-02-20T17:00:00.000Z
16,ACAT1,HGNC:93,beta-ketothiolase deficiency,MONDO_0008760,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-05-22T16:00:00.000Z
17,ACSL4,HGNC:3571,non-syndromic X-linked intellectual disability,MONDO_0019181,SOP4,Moderate,https://search.clinicalgenome.org/kb/gene-vali...,2017-10-20T00:00:00
18,ACTA1,HGNC:129,hypertrophic cardiomyopathy,MONDO_0005045,SOP4,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,false
19,ACTA2,HGNC:130,familial thoracic aortic aneurysm and aortic d...,MONDO_0019625,SOP4,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2016-09-27T00:00:00


In [42]:
### Login for running WDI

print("Logging in...") 

# Enter your own username and password ** to be updated to ProteinBoxBot
os.environ["WDUSER"] = "username" # Uses os package to call and set the environment for wikidata username
os.environ["WDPASS"] = "password"

# Conditional that outputs error command if not in the local python environment
if "WDUSER" in os.environ and "WDPASS" in os.environ: 
    WDUSER = os.environ['WDUSER']
    WDPASS = os.environ['WDPASS']
else: 
    raise ValueError("WDUSER and WDPASS must be specified in local.py or as environment variables")      

# Sets attributed username and password as 'login'
login = wdi_login.WDLogin(WDUSER, WDPASS) 

Logging in...
https://www.wikidata.org/w/api.php
Successfully logged in as Sulhasan


### For loop that iterates across dataframe and uploads to WikiData

In [74]:
### Create a function for adding references to then be iterated in the loop "create_reference()"

def create_reference(): 
        refStatedIn = wdi_core.WDItemID(value="Q64403342", prop_nr="P248", is_reference=True) # ClinGen Qid = Q64403342, 'stated in' Pid = P248 
        refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True) # Calls on previous 'timeStringNow' string, 'retrieved' Pid = P813
        refURL = wdi_core.WDUrl((subsetdf.loc[index, 'Report Reference URL']), prop_nr="P854", is_reference=True) # 'reference URL' Pid = P854
        return [refStatedIn, refRetrieved, refURL]

In [None]:
def update_retrieved_if_new_multiple_refs(olditem, newitem, days=365, retrieved_pid='P813'):

    def is_equal_not_retrieved(oldref, newref):

        if len(oldref) != len(newref):
            return False
        oldref_minus_retrieved = [x for x in oldref if x.get_prop_nr() != retrieved_pid]
        newref_minus_retrieved = [x for x in newref if x.get_prop_nr() != retrieved_pid]
        if not all(x in oldref_minus_retrieved for x in newref_minus_retrieved):
            return False
        oldref_retrieved = [x for x in oldref if x.get_prop_nr() == retrieved_pid]
        newref_retrieved = [x for x in newref if x.get_prop_nr() == retrieved_pid]
        if (len(newref_retrieved) != len(oldref_retrieved)):
            return False
        return True

    def ref_overwrite(oldref, newref, days):

        if len(oldref) != len(newref):
            return True
        oldref_minus_retrieved = [x for x in oldref if x.get_prop_nr() != retrieved_pid]
        newref_minus_retrieved = [x for x in newref if x.get_prop_nr() != retrieved_pid]
        if not all(x in oldref_minus_retrieved for x in newref_minus_retrieved):
            return True
        oldref_retrieved = [x for x in oldref if x.get_prop_nr() == retrieved_pid]
        newref_retrieved = [x for x in newref if x.get_prop_nr() == retrieved_pid]
        if (len(newref_retrieved) != len(oldref_retrieved)) or not (
                        len(newref_retrieved) == len(oldref_retrieved) == 1):
            return True
        datefmt = '+%Y-%m-%dT%H:%M:%SZ'
        retold = list([datetime.strptime(r.get_value()[0], datefmt) for r in oldref if r.get_prop_nr() == retrieved_pid])[0]
        retnew = list([datetime.strptime(r.get_value()[0], datefmt) for r in newref if r.get_prop_nr() == retrieved_pid])[0]
        return (retnew - retold).days >= days

    newrefs = newitem.references
    oldrefs = olditem.references

    found_mate = [False] * len(newrefs)
    for new_n, newref in enumerate(newrefs):
        for old_n, oldref in enumerate(oldrefs):
            if is_equal_not_retrieved(oldref, newref):
                found_mate[new_n] = True
                if ref_overwrite(oldref, newref, days):
                    oldrefs[old_n] = newref
    for f_idx, f in enumerate(found_mate):
        if not f:
            oldrefs.append(newrefs[f_idx])

In [73]:
### Create empty columns for output file (ignore warnings)

subsetdf['Status'] = "pending" # "Status" column with 'pending' for all cells: 'error', 'complete', 'skipped' (meaning previously logged)
subsetdf['Definitive'] = "" # Empty cell to be replaced with 'yes' or 'no' string
subsetdf['Gene QID'] = "" # To be replaced with 'absent' or 'multiple'
subsetdf['Disease QID'] = "" # To be replaced with 'absent' or 'multiple'

subsetdf

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the ca

Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,SOP,Classification,Report Reference URL,Report Date,Status,Definitive,Gene QID,Disease QID
15,ACADVL,HGNC:92,very long chain acyl-CoA dehydrogenase deficiency,MONDO_0008723,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-02-20T17:00:00.000Z,pending,,,
16,ACAT1,HGNC:93,beta-ketothiolase deficiency,MONDO_0008760,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-05-22T16:00:00.000Z,pending,,,
17,ACSL4,HGNC:3571,non-syndromic X-linked intellectual disability,MONDO_0019181,SOP4,Moderate,https://search.clinicalgenome.org/kb/gene-vali...,2017-10-20T00:00:00,pending,,,
18,ACTA1,HGNC:129,hypertrophic cardiomyopathy,MONDO_0005045,SOP4,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,false,pending,,,
19,ACTA2,HGNC:130,familial thoracic aortic aneurysm and aortic d...,MONDO_0019625,SOP4,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2016-09-27T00:00:00,pending,,,


In [77]:
### For loop that executes the following through each row of the dataframe 

for index, row in subsetdf.iterrows(): # Index is a row number, row is all variables and values for that row
    
    
    # Conditional that iterates only for rows where the Classification is 'Definitive'
    if row['Classification']!='Definitive': # If the string is NOT 'Definitive' for the Classification column
        subsetdf.at[index, 'Status'] = "error" # Then input "error" in the Status column
        subsetdf.at[index, 'Definitive'] = "no" # Then input 'no' for Definitive column
        continue # And skip the rest of the for loop
    else: # Otherwise
        subsetdf.at[index, 'Definitive'] = "yes" # Input 'yes' for Definitive column go on to next step
        
        
    # Identify the string in the Gene or Disease column for a given row
    HGNC = subsetdf.loc[index, 'Gene'] 
    MONDO = subsetdf.loc[index, 'MONDO Disease ID'].replace("_", ":") # .replace() changes _ to : for SparQL query
    
    
    # SparQL query to search for Gene or Diseasae in Wikidata based on HGNC (P353) or MONDO (P5270)
    sparqlQuery_HGNC = "SELECT * WHERE {?gene wdt:P353 \""+HGNC+"\"}" 
    result_HGNC = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_HGNC) # Resultant query
    sparqlQuery_MONDO = "SELECT * WHERE {?disease wdt:P5270 \""+MONDO+"\"}" 
    result_MONDO = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_MONDO)
    
    
    # Nested conditional that utilizes length function to call upon the result dictionary for either Gene or Disease
    if len(result_HGNC["results"]["bindings"])==1: # We only want one Q# result 
        HGNC_qid = result_HGNC["results"]["bindings"][0]["gene"]["value"].replace("http://www.wikidata.org/entity/", "")
        subsetdf.at[index, 'Gene QID'] = HGNC_qid # Input HGNC Qid in 'Gene QID' cell
        
        # Nest for MONDO to ensure it checks for both at the same time (and doesn't skip) 
        if len(result_MONDO["results"]["bindings"])==1: 
            MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
            subsetdf.at[index, 'Disease QID'] = MONDO_qid 
        elif len(result_MONDO["results"]["bindings"])<1: 
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "absent" 
            continue
        else:
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "multiple" 
            continue
            
    elif len(result_HGNC["results"]["bindings"])<1: # If the Qid is less than 1 
        subsetdf.at[index, 'Status'] = "error" 
        subsetdf.at[index, 'Gene QID'] = "absent" 
        
        if len(result_MONDO["results"]["bindings"])==1: 
            MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
            subsetdf.at[index, 'Disease QID'] = MONDO_qid 
        elif len(result_MONDO["results"]["bindings"])<1: 
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "absent" 
            continue
        else:
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "multiple" 
            continue
            
        continue
        
    else: # If the Qid is greater than 1
        subsetdf.at[index, 'Status'] = "error" 
        subsetdf.at[index, 'Gene QID'] = "multiple" 
        
        if len(result_MONDO["results"]["bindings"])==1: 
            MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
            subsetdf.at[index, 'Disease QID'] = MONDO_qid 
        elif len(result_MONDO["results"]["bindings"])<1: 
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "absent" 
            continue
        else:
            subsetdf.at[index, 'Status'] = "error" 
            subsetdf.at[index, 'Disease QID'] = "multiple" 
            continue
            
        continue
        
    # Call upon create_reference() function created  
    reference = create_reference() 
 
    # Add disease value to gene item page, and gene value to disease item page

    statement_HGNC = [wdi_core.WDItemID(value=MONDO_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] # Creates 'gene assocation' statement (P2293) whether or not it's already there, and includes the references
    wikidata_HGNCitem = wdi_core.WDItemEngine(wd_item_id=HGNC_qid,
                                              data=statement_HGNC, 
                                              ref_handler=update_retrieved_if_new_multiple_refs, # ???
                                              append_value=["P2293"])
    wikidata_HGNCitem.get_wd_json_representation() # Gives json structure that submitted to API, helpful for debugging 
    statement_MONDO = [wdi_core.WDItemID(value=HGNC_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] # Symmetry for disease item page
    wikidata_MONDOitem = wdi_core.WDItemEngine(wd_item_id=MONDO_qid,data=statement_MONDO, append_value=["P2293"])
    wikidata_MONDOitem.get_wd_json_representation()
    
    subsetdf.at[index, 'Status'] = "complete" 
    print(colored(HGNC,"blue"), "Gene successfully logged as", colored(wikidata_HGNCitem.write(login),"blue"), "and", colored(MONDO,"green"), "Disease successfully logged as", colored(wikidata_MONDOitem.write(login),"green"))

# Write output to a .csv file
now = datetime.now() # Retrieves current time and saves it as 'now'
# Includes hour:minute:second_dd-mm-yyyy time stamp (https://en.wikipedia.org/wiki/ISO_8601)
subsetdf.to_csv("ClinGenBot_Status-Output_" + now.isoformat() + ".csv")  # isoformat
subsetdf

[34mACADVL[0m Gene successfully logged as [34mQ15996541[0m and [32mMONDO:0008723[0m Disease successfully logged as [32mQ7923095[0m
[34mACAT1[0m Gene successfully logged as [34mQ14913201[0m and [32mMONDO:0008760[0m Disease successfully logged as [32mQ4897218[0m


Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,SOP,Classification,Report Reference URL,Report Date,Status,Definitive,Gene QID,Disease QID
15,ACADVL,HGNC:92,very long chain acyl-CoA dehydrogenase deficiency,MONDO_0008723,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-02-20T17:00:00.000Z,complete,yes,Q15996541,Q7923095
16,ACAT1,HGNC:93,beta-ketothiolase deficiency,MONDO_0008760,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-05-22T16:00:00.000Z,complete,yes,Q14913201,Q4897218
17,ACSL4,HGNC:3571,non-syndromic X-linked intellectual disability,MONDO_0019181,SOP4,Moderate,https://search.clinicalgenome.org/kb/gene-vali...,2017-10-20T00:00:00,error,no,,
18,ACTA1,HGNC:129,hypertrophic cardiomyopathy,MONDO_0005045,SOP4,No Reported Evidence,https://search.clinicalgenome.org/kb/gene-vali...,false,error,no,,
19,ACTA2,HGNC:130,familial thoracic aortic aneurysm and aortic d...,MONDO_0019625,SOP4,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2016-09-27T00:00:00,error,yes,Q17709258,absent
