# Scheduled Integration of ClinGen Gene-Disease Validity Data into WikiData

ClinGen (Clinical Genome Resource) develops curated data of genetic associations <br>
CC0 https://clinicalgenome.org/docs/terms-of-use/

This scheduled bot operates weekly through WDI to integrate ClinGen Gene-Disease Validity Data <br>
https://search.clinicalgenome.org/kb/gene-validity/ <br>
https://github.com/SuLab/GeneWikiCentral/issues/116 <br>
http://jenkins.sulab.org/ <br>

Python script contributions, in order: Sabah Ul-Hasan, Andra Waagmeester, Andrew Su, Ginger Tsueng

## Checks and Tests

- Login automatically aligns with given environment 
- For loop checks for both HGNC Qid and MONDO Qid per each row (ie if HGNC absent or multiple, then checks MONDO) 
- For loop works on multiple Qid option, tested using A2ML1 as pseudo example
- For loop puts correct Qid for either HGNC or MONDO, if available 
- Error (Status column of output): Classification!=Definitive, HGNC Qid absent or > 1, MONDO Qid absent or > 1
- Complete (Status column of output): Classification=Definitive, HGNC Qid = 1, MONDO Qid = 1 
**Only writes in output if written to Wikidata**
- Updated (Status column of output): Complete for when reference is > 180 days, or if not already written </br>
**For loop skips row if written within 180 days**

In [1]:
# Relevant Modules and Libraries

## Installations by shell 
!pip install --upgrade pip # Installs pip, ensures it's up-to-date
!pip3 install tqdm # Visualizes installation progress (progress bar)
!pip3 install wikidataintegrator # For Wikidata

## Installations by python
from wikidataintegrator import wdi_core, wdi_login # Core and login from wikidataintegrator module
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs # For retrieving references
from datetime import datetime # For identifying the current date and time

import copy # Copies references needed in the .csv for uploading to wikidata
import time # For keeping track of total for loop run time

import os # OS package to ensure interaction between the modules (ie WDI) and current OS being used

import pandas as pd # Pandas for data organization, then abbreviated to pd
import numpy as np # Another general purpose package

# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

Requirement already up-to-date: pip in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (20.0.2)


In [2]:
# Login for running WDI

print("Logging in...") 

## **remove lines when scheduling to Jenkins** Enter your own username and password 
os.environ["WDUSER"] = "Sulhasan" # Uses os package to call and set the environment for wikidata username
os.environ["WDPASS"] = "Sprecial3#"

## Conditional that outputs error command if not in the local python environment
if "WDUSER" in os.environ and "WDPASS" in os.environ: 
    WDUSER = os.environ['WDUSER']
    WDPASS = os.environ['WDPASS']
else: 
    raise ValueError("WDUSER and WDPASS must be specified in local.py or as environment variables")      

## Sets attributed username and password as 'login'
login = wdi_login.WDLogin(WDUSER, WDPASS) 

Logging in...
https://www.wikidata.org/w/api.php
Successfully logged in as Sulhasan


In [16]:
## Read as csv
df = pd.read_csv('https://search.clinicalgenome.org/kb/gene-validity.csv', skiprows=6, header=None)  

# save manually as testdf.csv with 3 genes
### HADHA = none
### MEGF10 = written
### KCNT1 = new

In [19]:
# ClinGen gene-disease validity data

## Read as csv
df = pd.read_csv('/Users/sulhasan/Desktop/Su-Lab/ClinGen-Bot_GeneWikiCentral-Issue116/1-SuLabREPO_Jan-Feb2020_clingen-bot_Revisions/testdf.csv') # test


## Label column headings
df.columns = ['Gene', 'HGNC Gene ID', 'Disease', 'MONDO Disease ID', 'Inheritance', 'SOP','Classification','Report Reference URL','Report Date']


## Create time stamp of when downloaded (error if isoformat() used)
timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")

## Create empty columns for output file (ignore warnings)
df['Status'] = "pending" # "Status" column with 'pending' for all cells:
### 'none' (no write, due to some criteria)
### 'written' (previously written or entered)
### 'updated' (write updated if > 180 days)
### 'new' (new write)
df['Definitive'] = "" # Empty cell to be replaced with 'yes' or 'no' string
df['Gene QID'] = "" # To be replaced with 'check' or 'multiple'
df['Disease QID'] = "" # To be replaced with 'check' or 'multiple'
### 'check' meaning there may be a Qid, but Identifier (HGNC or MONDO) is missing for that Item Page 

df.head()

Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,Inheritance,SOP,Classification,Report Reference URL,Report Date,Status,Definitive,Gene QID,Disease QID
0,HADHA,HGNC:4801,long chain 3-hydroxyacyl-CoA dehydrogenase def...,MONDO_0012173,Autosomal Recessive,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-02-12T17:00:00.000Z,pending,,,
1,MEGF10,HGNC:29634,early-onset myopathy-areflexia-respiratory dis...,MONDO_0013731,Autosomal Recessive,SOP7,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2020-01-27T17:00:00.000Z,pending,,,
2,KCNT1,HGNC:18865,childhood-onset epilepsy syndrome,MONDO_0020072,Other,SOP4,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2017-10-20T00:00:00,pending,,,


In [13]:
# Create a function for adding references to then be iterated in the loop "create_reference()"

def create_reference(): # Indicates a parameter included before running rest of function (otherwise may not recognize)
        refStatedIn = wdi_core.WDItemID(value="Q64403342", prop_nr="P248", is_reference=True) # ClinGen Qid = Q64403342, 'stated in' Pid = P248 
        timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z") # Create time stamp of when downloaded (error if isoformat() used)
        refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True) # Calls on previous 'timeStringNow' string, 'retrieved' Pid = P813
        refURL = wdi_core.WDUrl((df.loc[index, 'Report Reference URL']), prop_nr="P854", is_reference=True) # 'reference URL' Pid = P854
        return [refStatedIn, refRetrieved, refURL]

In [24]:
start_time = time.time() # Keep track of how long it takes loop to run

# For loop executing the following through each row of the dataframe 
for index, row in df.iterrows(): 
        
    # Assign the relevant ID number to of a respective gene (HGNC) or disease (MonDO) 
    HGNC = row['HGNC Gene ID'].replace("HGNC:", "") # .replace() edits HGNC: to space for SparQL query
    MONDO = row['MONDO Disease ID'].replace("_", ":")
    
    # SparQL query to search for Gene or Disease in Wikidata based on HGNC ID (P354) or MonDO ID (P5270)
    sparqlQuery_HGNC = "SELECT * WHERE {?gene wdt:P354 \""+HGNC+"\"}" 
    result_HGNC = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_HGNC) # Resultant query
    sparqlQuery_MONDO = "SELECT * WHERE {?disease wdt:P5270 \""+MONDO+"\"}" 
    result_MONDO = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_MONDO)
    
    # Assign resultant query as the length of its dictionary for either Gene or Disease (number of Qid)
    HGNC_qlength = len(result_HGNC["results"]["bindings"]) 
    MONDO_qlength = len(result_MONDO["results"]["bindings"])
    
    # Conditional utilizing length value for output table, accounts for present/absent combos
    if HGNC_qlength == 1:
        HGNC_qid = result_HGNC["results"]["bindings"][0]["gene"]["value"].replace("http://www.wikidata.org/entity/", "")
        df.at[index, 'Gene QID'] = HGNC_qid # Input HGNC Qid in 'Gene QID' cell  
    if HGNC_qlength < 1: # If no Qid
        df.at[index, 'Status'] = "none" 
        df.at[index, 'Gene QID'] = "check" # It could be the Qid is absent, or the Identifier on page is absent  
    if HGNC_qlength > 1: # If multiple Qid
        df.at[index, 'Status'] = "none" 
        df.at[index, 'Gene QID'] = "multiple"
        
    if MONDO_qlength == 1:
        MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
        df.at[index, 'Disease QID'] = MONDO_qid  
    if MONDO_qlength < 1: 
        df.at[index, 'Status'] = "none" 
        df.at[index, 'Disease QID'] = "check" 
    if MONDO_qlength > 1:
        df.at[index, 'Status'] = "none" 
        df.at[index, 'Disease QID'] = "multiple" 
        
    # Conditional inputs 'none' where Classification != 'Definitive'
    ## Criterion
    if row['Classification']!='Definitive': # If the string is NOT 'Definitive' for the Classification column
        df.at[index, 'Status'] = "none" # Then input "none" in the Status column
        df.at[index, 'Definitive'] = "no" # And'no' for Definitive column
        continue # Skips rest and goes to next row
    else: # Otherwise
        df.at[index, 'Definitive'] = "yes" # Input 'yes' for Definitive column, go to next step
  
    # Conditional continues to write into WikiData only if 1 Qid for each + 'Definitive' Classification 
    if HGNC_qlength == 1 & MONDO_qlength == 1:
        
        # Call upon create_reference() function created   
        reference = create_reference() 
        
        # Add disease value to gene item page, and gene value to disease Item Page (symmetry)
        ## Creates 'gene assocation' statement (P2293), or adds to current one if already there, and includes the references
        statement_HGNC = [wdi_core.WDItemID(value=MONDO_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] 
        wikidata_HGNCitem = wdi_core.WDItemEngine(wd_item_id=HGNC_qid, 
                                                  data=statement_HGNC, 
                                                  global_ref_mode='CUSTOM', # parameter that looks within 180 days
                                                  ref_handler=update_retrieved_if_new_multiple_refs, 
                                                  append_value=["P2293"])  
        wikidata_HGNCitem.get_wd_json_representation()
        #wikidata_HGNCitem.write(login)
        test = wikidata_HGNCitem.lastrevid
        
        df.at[index, 'Status'] = "new" 
        # If already written then
            ## written
        # If new write then
            ## new
        # else 
            ## updated
            
end_time = time.time() # Captures when loop run ends
print("The total time of this loop is:", end_time - start_time, "seconds, or", (end_time - start_time)/60, "minutes")

# Write output to a .csv file
now = datetime.now() # Retrieves current time and saves it as 'now'
# Includes hour:minute:second_dd-mm-yyyy time stamp (https://en.wikipedia.org/wiki/ISO_8601)
# df.to_csv("ClinGenBot_Status-Output_" + now.isoformat() + ".csv")  # isoformat
df

None
None
The total time of this loop is: 2.1804747581481934 seconds, or 0.03634124596913656 minutes


Unnamed: 0,Gene,HGNC Gene ID,Disease,MONDO Disease ID,Inheritance,SOP,Classification,Report Reference URL,Report Date,Status,Definitive,Gene QID,Disease QID
0,HADHA,HGNC:4801,long chain 3-hydroxyacyl-CoA dehydrogenase def...,MONDO_0012173,Autosomal Recessive,SOP6,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2018-02-12T17:00:00.000Z,none,yes,Q1145906,check
1,MEGF10,HGNC:29634,early-onset myopathy-areflexia-respiratory dis...,MONDO_0013731,Autosomal Recessive,SOP7,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2020-01-27T17:00:00.000Z,new,yes,Q18047620,Q56002943
2,KCNT1,HGNC:18865,childhood-onset epilepsy syndrome,MONDO_0020072,Other,SOP4,Definitive,https://search.clinicalgenome.org/kb/gene-vali...,2017-10-20T00:00:00,new,yes,Q18043170,Q5382985


In [31]:
start_time = time.time() # Keep track of how long it takes loop to run

# For loop executing the following through each row of the dataframe 
for index, row in df.iterrows(): 
        
    # Assign the relevant ID number to of a respective gene (HGNC) or disease (MONDO) 
    HGNC = row['HGNC Gene ID'].replace("HGNC:", "") # .replace() edits HGNC: to space for SparQL query
    MONDO = row['MONDO Disease ID'].replace("_", ":")
    
    # SparQL query to search for Gene or Disease in Wikidata based on HGNC ID (P354) or MonDO ID (P5270)
    sparqlQuery_HGNC = "SELECT * WHERE {?gene wdt:P354 \""+HGNC+"\"}" 
    result_HGNC = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_HGNC) # Resultant query
    sparqlQuery_MONDO = "SELECT * WHERE {?disease wdt:P5270 \""+MONDO+"\"}" 
    result_MONDO = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_MONDO)
    
    # Assign resultant length of dictionary for either Gene or Disease (number of Qid)
    HGNC_qlength = len(result_HGNC["results"]["bindings"]) 
    MONDO_qlength = len(result_MONDO["results"]["bindings"])
    
    # Conditional utilizing length value for output table, accounts for absent/present combos
    if HGNC_qlength == 1:
        HGNC_qid = result_HGNC["results"]["bindings"][0]["gene"]["value"].replace("http://www.wikidata.org/entity/", "")
        df.at[index, 'Gene QID'] = HGNC_qid # Input HGNC Qid in 'Gene QID' cell  
    if HGNC_qlength < 1: # If no Qid
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Gene QID'] = "absent"  
    if HGNC_qlength > 1: # If multiple Qid
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Gene QID'] = "multiple"
        
    if MONDO_qlength == 1:
        MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
        df.at[index, 'Disease QID'] = MONDO_qid  
    if MONDO_qlength < 1: 
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Disease QID'] = "absent" 
    if MONDO_qlength > 1:
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Disease QID'] = "multiple" 
        
    # Conditional inputs error such that only rows are written for where Classification = 'Definitive'
    if row['Classification']!='Definitive': # If the string is NOT 'Definitive' for the Classification column
        df.at[index, 'Status'] = "error: Classification not Definitive" # Then input "error" in the Status column
        df.at[index, 'Definitive'] = "no" # And'no' for Definitive column
        continue # Skips rest and goes to next row
    else: # Otherwise
        df.at[index, 'Definitive'] = "yes" # Input 'yes' for Definitive column, go to next step
  
    # Conditional continues to write into WikiData only if 1 Qid for each + Definitive classification 
    if HGNC_qlength == 1 & MONDO_qlength == 1:
        
        # Call upon create_reference() function created   
        reference = create_reference() 
        
        # Add disease value to gene item page, and gene value to disease item page (symmetry)
        
        # Creates 'gene assocation' statement (P2293) whether or not it's already there, and includes the references
        statement_HGNC = [wdi_core.WDItemID(value=MONDO_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] 
        wikidata_HGNCitem = wdi_core.WDItemEngine(wd_item_id=HGNC_qid, 
                                                  data=statement_HGNC, 
                                                  global_ref_mode='CUSTOM', # parameter that looks within 180 days
                                                  ref_handler=update_retrieved_if_new_multiple_refs, 
                                                  append_value=["P2293"])  
        wikidata_HGNCitem.get_wd_json_representation()
        #wikidata_HGNCitem.write(login)
        print(wikidata_HGNCitem.lastrevid)

        statement_MONDO = [wdi_core.WDItemID(value=HGNC_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] 
        wikidata_MONDOitem = wdi_core.WDItemEngine(wd_item_id=MONDO_qid, 
                                                   data=statement_MONDO, 
                                                   global_ref_mode='CUSTOM',
                                                   ref_handler=update_retrieved_if_new_multiple_refs, 
                                                   append_value=["P2293"])
        wikidata_MONDOitem.get_wd_json_representation()
        #wikidata_MONDOitem.write(login)
        
        #df.at[index, 'Status'] = "complete" 
        
end_time = time.time() # Captures when loop run ends
print("The total time of this loop is:", end_time - start_time, "seconds, or", (end_time - start_time)/60, "minutes")

# Write output to a .csv file
now = datetime.now() # Retrieves current time and saves it as 'now'
# Includes hour:minute:second_dd-mm-yyyy time stamp (https://en.wikipedia.org/wiki/ISO_8601)
# df.to_csv("ClinGenBot_Status-Output_" + now.isoformat() + ".csv")  # isoformat

None
None
None
None
None
None
None
None
The total time of this loop is: 8.593843936920166 seconds, or 0.14323073228200275 minutes
