This is an update on [01_querying_wikidata_for_hetnet_edges.ipynb](https://github.com/SuLab/WD-rephetio-analysis/blob/v1.0/1_code/01_querying_wikidata_for_hetnet_edges.ipynb) to address the expansion in the Wikidata knowledgebase currently resulting in a timeout during SPARQL query of nodes and edges.

**Input**: Anatomy, Biological Process, Cellular Component, Compounds, Disease, Genes-Pathway-Protein, Phenotype, Protein, Molecular Function
<br>8 categories, each categorized as individual .ndjson files
<br>[Derived from this notebook, an update of MM's metapaths](https://github.com/sabahzero/WRP/blob/main/src/archive/01a_Wikidata-Nodes.ipynb)

**Output**
ID, LABEL, name (Mike's nodes_2019-09-03.csv as an example)

<br>
<br>


### Confirmed Checks
- Items are correctly binned in respective files
<br>*exception* gpp

### Checks to Confirm
- Numbers match up with what Mike had (specifically bp and mf, which should be unchanged) 
- Put a timer on how long each dataset takes to load
- Create a table: Input category, # of item pages, time it takes to load/run, output
- Update nodes, including separating current GPP category
- Cross over of QIDs and duplicates (which should be which)
- How do I pull out edges?

In [1]:
import pandas as pd

import time 
from datetime import datetime

In [9]:
nodes_mike = pd.read_csv('file-examples/nodes_2019-09-03.csv')  
edges_mike = pd.read_csv('file-examples/edges_2019-09-03.csv')  

The total time of this upload is: 0.010258221626281738 minutes


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [24]:
nodes_mike.head()

Unnamed: 0,:ID,:LABEL,name
0,Q4735440,Compound,alpidem
1,Q5983577,Compound,ibacitabine
2,Q221174,Compound,topiramate
3,Q2016367,Compound,tetraphenylborate
4,Q168685,Compound,fentin acetate


In [11]:
edges_mike.head() # qualifier NaN (both head and tail)

Unnamed: 0,:START_ID,:END_ID,:TYPE,qualifier,inner1
0,Q144085,Q412496,SIGNIFICANT_DRUG_INTERACTION_CdiC,,
1,Q3712521,Q419394,SIGNIFICANT_DRUG_INTERACTION_CdiC,,
2,Q410061,Q417597,SIGNIFICANT_DRUG_INTERACTION_CdiC,,
3,Q174259,Q7739,SIGNIFICANT_DRUG_INTERACTION_CdiC,,
4,Q3791612,Q41576,SIGNIFICANT_DRUG_INTERACTION_CdiC,,


In [12]:
edges_mike.tail()

Unnamed: 0,:START_ID,:END_ID,:TYPE,qualifier,inner1
614760,Q21112306,Q14878358,UP_REGULATES_GuBP,,Q14761970
614761,Q21111886,Q14864067,UP_REGULATES_GuBP,,Q14863719
614762,Q11705457,Q14864067,UP_REGULATES_GuBP,,Q14863719
614763,Q423701,Q14878358,UP_REGULATES_GuBP,,Q14761970
614764,Q3616234,Q14878358,UP_REGULATES_GuBP,,Q14761970


In [2]:
# Create time stamp
timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z") 
start_time = time.time()

# Read files into memory
anatomy = pd.read_json('files-Jan112022/anatomy.ndjson', lines=True) # Anatomy
bp = pd.read_json('files-Jan112022/biologicalprocess.ndjson', lines=True) # Biological Process
cc = pd.read_json('files-Jan112022/cellularcomponent.ndjson', lines=True) # Cellular Component
compounds = pd.read_json('files-Jan112022/compounds.ndjson', lines=True) # Compounds (large)
disease = pd.read_json('files-Jan112022/disease.ndjson', lines=True) # Disease 
gpp = pd.read_json('files-Jan112022/genes-pathway-protein.ndjson', lines=True) # Genes, Pathways, Proteins (large)
mf = pd.read_json('files-Jan112022/molecularfunction.ndjson', lines=True) # Molecular Function
phenotype = pd.read_json('files-Jan112022/phenotype.ndjson', lines=True) # Phenotype

# Output and print when query is complete
end_time = time.time() 
print("The total time of this upload is:", (end_time - start_time)/60, "minutes") # 27 min

The total time of this query is: 30.015519579251606 minutes


In [39]:
# Create time stamp
timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z") 
start_time = time.time()

# Read files into memory
anatomy = pd.read_json('files-Jan112022/anatomy.ndjson', lines=True) # Anatomy
bp = pd.read_json('files-Jan112022/biologicalprocess.ndjson', lines=True) # Biological Process
cc = pd.read_json('files-Jan112022/cellularcomponent.ndjson', lines=True) # Cellular Component
disease = pd.read_json('files-Jan112022/disease.ndjson', lines=True) # Disease 
mf = pd.read_json('files-Jan112022/molecularfunction.ndjson', lines=True) # Molecular Function
phenotype = pd.read_json('files-Jan112022/phenotype.ndjson', lines=True) # Phenotype

# Output and print when query is complete
end_time = time.time() 
print("The total time of this upload is:", (end_time - start_time)/60, "minutes") # 15 min

The total time of this upload is: 14.67608863512675 minutes


In [3]:
gpp.head() # protein and chemical compounds

Unnamed: 0,type,id,labels,descriptions,aliases,claims,sitelinks,lastrevid
0,item,Q27205,"{'zh': {'language': 'zh', 'value': '纖維蛋白'}, 'e...","{'id': {'language': 'id', 'value': 'protein'},...","{'zh': [{'language': 'zh', 'value': '血纖蛋白'}, {...","{'P508': [{'mainsnak': {'snaktype': 'value', '...","{'zhwiki': {'site': 'zhwiki', 'title': '纖維蛋白',...",1441196849
1,item,Q43656,"{'zh': {'language': 'zh', 'value': '膽固醇'}, 'ky...","{'it': {'language': 'it', 'value': 'molecola l...","{'jv': [{'language': 'jv', 'value': 'Koléstêro...","{'P373': [{'mainsnak': {'snaktype': 'value', '...","{'sqwiki': {'site': 'sqwiki', 'title': 'Kolest...",1446745797
2,item,Q49546,"{'ar': {'language': 'ar', 'value': 'أسيتون'}, ...","{'ru': {'language': 'ru', 'value': 'простейший...","{'it': [{'language': 'it', 'value': 'Dimetilch...","{'P1579': [{'mainsnak': {'snaktype': 'value', ...","{'commonswiki': {'site': 'commonswiki', 'title...",1442159222
3,item,Q63398,"{'fr': {'language': 'fr', 'value': 'Secrétoneu...","{'en': {'language': 'en', 'value': 'mammalian ...","{'en': [{'language': 'en', 'value': 'CHGB'}, {...","{'P352': [{'mainsnak': {'snaktype': 'value', '...",{},1306060411
4,item,Q105522,"{'zh-hans': {'language': 'zh-hans', 'value': '...","{'it': {'language': 'it', 'value': 'il prodott...","{'he': [{'language': 'he', 'value': 'חומצה אור...","{'P31': [{'mainsnak': {'snaktype': 'value', 'p...","{'enwiki': {'site': 'enwiki', 'title': 'Uric a...",1444543825


In [4]:
gpp.tail() # pathway, bp, ??? gene ???

Unnamed: 0,type,id,labels,descriptions,aliases,claims,sitelinks,lastrevid
45200,item,Q36804479,"{'en': {'language': 'en', 'value': 'Transcript...","{'en': {'language': 'en', 'value': 'An instanc...",{},"{'P2860': [{'mainsnak': {'snaktype': 'value', ...",{},1403180928
45201,item,Q36804509,"{'en': {'language': 'en', 'value': 'PCBP4 modu...","{'en': {'language': 'en', 'value': 'An instanc...",{},"{'P2860': [{'mainsnak': {'snaktype': 'value', ...",{},1403061521
45202,item,Q36811970,"{'en': {'language': 'en', 'value': 'Chk1/Chk2(...","{'en': {'language': 'en', 'value': 'An instanc...",{},"{'P361': [{'mainsnak': {'snaktype': 'value', '...",{},1403059511
45203,item,Q36813090,"{'en': {'language': 'en', 'value': 'Loading of...","{'en': {'language': 'en', 'value': 'An instanc...",{},"{'P703': [{'mainsnak': {'snaktype': 'value', '...",{},1403306911
45204,item,Q36813105,"{'en': {'language': 'en', 'value': 'GTSE1 bind...","{'en': {'language': 'en', 'value': 'An instanc...",{},"{'P703': [{'mainsnak': {'snaktype': 'value', '...",{},1403112327


In [15]:
# http://files.hpc.weso.es/
bp_andra = pd.read_json('file-examples/result_biological_process.json', orient=str) # Biological Process
mf_andra = pd.read_json('file-examples/result_molecular_funcion.json', orient=str) # Molecular Function

The total time of this upload is: 0.02764338254928589 minutes


In [16]:
bp_andra.head() # empty (contacted Andra) -- check back with Andra on Monday

### Comparison of data to Andra and Mike (bp and mf should be ~identical across, anatomy as test)

In [21]:
print(len(mf_andra), "- Andra (wdsub)",)
print(len(mf), "- Me (WDF, P31:Q14860489)") # I'm closer to this number for Mike, but still low

# adjust categories

10940 - Andra (wdsub)
4096 - Me (WDF, P31:Q14860489)


In [31]:
nodes_mike[':LABEL'].value_counts() # 19 categories

Biological Process           24718
Gene                         23737
Protein Family               10406
Compound                      8557
Disease                       8302
Protein Domain                5230
Molecular Function            4397
Biological Pathway            2461
Sequence Variant              1835
Cellular Component            1785
Chemical Hazard                687
Super-Secondary Structure      663
Chemical Role                  499
Symptom                        228
Anatomical Structure           218
Active Site                    114
Structural Motif               109
Medical Specialty               69
Binding Site                    69
Name: :LABEL, dtype: int64

In [34]:
print(len(bp), "- Me (WDF, P31:Q2996394)") 
print(len(anatomy), "- Me (WDF, P1402)")

# Note Mike had 2508 here... https://github.com/mmayers12/metapaths/blob/main/1_code/01a_WikiData_Nodes.ipynb

14171 - Me (WDF, P31:Q2996394)
540 - Me (WDF, P1402)


### Convert each to category csv and compile

In [38]:
# Subset from .ndjson, need to manually label (convert to csv I can then append)
### Make sure identitical to Mike's doc for terms (see above output)

# compile them and come back to it later

## Anatomy 
csvFile = open('nodes.csv', 'w')

for index, row in anatomy.iterrows():
    print(row["id"]+",Anatomy,"+row["labels"]["en"]["value"], file = csvFile)
    
    
for index, row in bp.iterrows():
    print(row["id"]+",Biological Process,"+row["labels"]["en"]["value"], file = csvFile)
    
csvFile.close()

csvdf = pd.read_csv('nodes.csv')
csvdf.head()

## Biological Process



## Cellular Component

## Compound

## Disease

## Genes-Pathway-Protein

## Phenotype **not in rephetio**

## Protein **not in rephetio**

## Molecular Function


ParserError: Error tokenizing data. C error: Expected 3 fields in line 545, saw 5


In [None]:
#edges

### Input
19 categories (Rephetio): Biological Process, Gene, Protein Family, Compound, Disease, Protein Domain, Molecular Function, Biological Pathway, Sequence Variant, Cellular Component, Chemical Hazard, Super-Secondary Structure, Chemical Role, Symptom, Anatomical Structure, Active Site, Structural Motif, Medical Specialty, Binding Site <br>

2. [16 categories](https://github.com/SuLab/WD-rephetio-analysis/blob/v1.0/1_code/01_querying_wikidata_for_hetnet_edges.ipynb) for nodes.csv: Active Site, Anatomical Structure, Biological Pathway, Biological Process, Binding Site, Chemical Hazard, Chemical Role, Cellular Component, Compound,  Disease, Gene, Medical Specialty, Molecular Function, Protein Domain, Protein Family, Sequence Variant, Super-Secondary Structure Structural Motif, Symptom <br> **need to check if this is 16 or 18**
3. [25 categories](http://files.hpc.weso.es/): Active Site, Anatomical Structure, Binding Site, Biological Pathway, Biological Process, Chemical Compound, Chromosome, Cities, Disease, Mechanism of Action, Medication, Molecular Function (spelling), Pharmaceutical Product, Pharmalogical Action, Protein, Protein Domain, Protein Family, Ribosomal RNA, Sequence Variant, Subclass Anatomical Structure, Subclass Binding Site, Supersecondary Structure, Symptom, Taxon, Therapeutic Use 



### Checks
- Are there duplicates in Mike's code for QID, and can I resolve this directly with the json?
- All node and edge categories transferred

### Steps
1. Data dump (.json) of all relevant biomedical entitites using WDF (Comparison of wdsub of WDumper to be determined)
2. 

### To Do
- Check that all item pages correspond to csv rows
- Include edges
- Translate over to WD Rephetio Analysis (Friday)
- Visualize

In [13]:
for index, row in df.iterrows():
    print(row["claims"]["P31"])
    sys.exit()

[{'mainsnak': {'snaktype': 'value', 'property': 'P31', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 103914748, 'id': 'Q103914748'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q7365$9390cb25-44c7-917c-7086-35e9c02f962b', 'rank': 'normal'}]


NameError: name 'sys' is not defined

In [4]:
dataTypeSeries = df.dtypes

print('Data type of each column of Dataframe :')
print(dataTypeSeries)

Data type of each column of Dataframe :
type            object
id              object
labels          object
descriptions    object
aliases         object
claims          object
sitelinks       object
lastrevid        int64
dtype: object


In [20]:

import os

outfilename = 'biohackathon-prefilter.csv.gz'
with gzip.open(outfilename, 'wt') as f:
    f.write(":ID,:LABEL,name\n".encode("UTF-8"))
    for item in j:
        csvline = item["id"]+",Compound,"+item["labels"]["en"]["value"]+"\n"
        f.write(csvline.encode("UTF-8"))

TypeError: write() argument must be str, not bytes