# Generating statistics for subset of Wikidata

This notebook illustrates how to generate statistics for a subset of Wikidata. \
We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound).

Example Dataset wikidata subset: https://drive.google.com/drive/u/1/folders/1KjNwV5M2G3JzCrPgqk_TSx8wTE49O2Sx \
Example Dataset statistics: https://drive.google.com/drive/u/1/folders/1_4Mxd0MAo0l9aR3aInv0YMTJrtneh7HW 

### Example Invocation command

    papermill /Users/shashanksaurabh/Desktop/MS/ISI/isi/kgtk_shashank73744/kgtk/examples/Example_9-Wikidata_Subset_Statistics_.ipynb \
    /Users/shashanksaurabh/Desktop/MS/ISI/isi/kgtk_shashank73744/kgtk/examples/Example_9_output.ipynb \
    -p wikidata_home '/Users/shashanksaurabh/Desktop/Data_isi' \
    -p wikidata_parts_folder '/Users/shashanksaurabh/Desktop/Data_isi/Chemical' \
    -p cache_folder '/Users/shashanksaurabh/Desktop/Data_isi/Temp' \
    -p output_folder '/Users/shashanksaurabh/Desktop/Data_isi/output' \
    -p delete_database 'yes' \
    -p K \"10\" \
    -p subset_name 'Q11173'


# Naming Convention:

## [subset_name].[section].[brief_description].tsv

## subset_name 
It is the qnode corresponding to the  wikidata subset, for example if the wikidata subset refers to the chemical compounds then “subset_name” is ”Q11173”

## section 
1. It is the section corresponding to the “Knowledge Graph Statistics” described above. For example if its class summary then it would be ‘1’. 

2. If there are more than one subsection for the main section then the subsection would be added after a dot(‘.’). For example ‘2.2’ corresponds to the ‘item properties’ under ‘Properties’. 

3. If there are more than one subsection to the parent subsection then the subsections would be recursively added after a dot(‘.’). For example while calculating the statistics of string for the section 2.6, it would be named 2.6.1

## brief_description:
1. Its a brief description just after section 
2. For class summary it’s ‘class_summary’
3. For ‘examples’ it’s the qnode for which 3 examples are shown
4. Its ‘property_summary_[type]’ for each of the data types in section 2.1.
5. Its ‘item_properties’ for section 2.3
6. Its ‘time_properties’ for section 2.4
7. Its ‘quantity_properties’ for section 2.5
8. Its ‘distincs_values_[type]’ for each of the type for section 2.6
9. Its ‘geo_coordinate_top_m’ for 2.7
10. Its ‘geo_shape_top_m’ for for 2.8


In [3]:
wikidata_home = "/Users/shashanksaurabh/Desktop/Data_isi"
# path to folder which contains all files corresponding to the wikidata subset. 
#(For more information on wikidata subset please check Example 8)
wikidata_parts_folder = "/Users/shashanksaurabh/Desktop/Data_isi/Chemical"
# The notebook creates a cache, which stores in the cache_folder. The cache can be deleted after the execution.
cache_folder = "/Users/shashanksaurabh/Desktop/Data_isi/Temp"
# path to the folder where the output (here statistics) would be stored
output_folder = "/Users/shashanksaurabh/Desktop/Data_isi/output"
# delete_database = "yes"
# The statistics also uses the consolidated kgtk file with all the properties and items. This is the path to consolidated file
# In each of statistics top K results are chosen. In the following examples this has been implemented using the --limit attribute.
K = "10"
subset_name = "Q11173"

In [4]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

from IPython.display import display, HTML

# import altair as alt
# alt.renderers.enable('altair_viewer')

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

RendererRegistry.enable('altair_viewer')

### Set up environment variables and folders that we need

In [5]:
# path to folder which contains all files corresponding to the wikidata subset. 
#(For more information on wikidata subset please check Example 8)
os.environ['WIKIDATA_PARTS'] = wikidata_parts_folder
# path to the folder where the output (here statistics) would be stored
os.environ['OUTPUT_FOLDER'] = output_folder
# kgtk command to run
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_folder)
os.environ['K'] = K
os.environ['label'] = subset_name + ".label.en.tsv.gz"
# file name corresponding to different part of the subsets.
os.environ['subset_name']  = subset_name
os.environ['external_id']  = subset_name + ".part.external-id.tsv.gz"
os.environ['time']  = subset_name +  ".part.time.tsv.gz "
os.environ['wikibase_item']  = subset_name + ".part.wikibase-item.tsv.gz"
os.environ['quantity']  = subset_name +  ".part.quantity.tsv.gz"
os.environ['statistics']  = subset_name + ".statistics.tsv.gz"
os.environ['wikibase_form']  = subset_name + ".part.wikibase-form.tsv.gz"
os.environ['monolingualtext']  = subset_name + ".part.monolingualtext.tsv.gz"
os.environ['math']  = subset_name + ".part.math.tsv.gz"
os.environ['commonsMedia']  = subset_name + ".part.commonsMedia.tsv.gz"
os.environ['globe_coordinate']  = subset_name + ".part.globe-coordinate.tsv.gz"
os.environ['musical_notation']  = subset_name + ".part.musical-notation.tsv.gz"
os.environ['geo_shape']  = subset_name + ".part.geo-shape.tsv.gz"
os.environ['url']  = subset_name + ".part.url.tsv.gz"
os.environ['string']  = subset_name + ".part.string.tsv.gz"
# Output file for statistics 1.1
os.environ['class_summary']  = subset_name + ".1.class_summary.tsv"
# Output files for statistics 2.1
os.environ['property_summary']  = subset_name + ".2.1.property_summary.tsv"
# Output files for statistics 2.2
os.environ['property_summary_external_id']  = subset_name + ".2.2.1property_summary_external_id.tsv"
os.environ['property_summary_time']  = subset_name + ".2.2.2property_summary_time.tsv"
os.environ['property_summary_wikibase_item']  = subset_name + ".2.2.3property_summary_wikibase_item.tsv"
os.environ['property_summary_quantity']  = subset_name + ".2.2.4.property_summary_quantity.tsv"
os.environ['property_summary_wikibase_form']  = subset_name + ".2.2.5.property_summary_wikibase_form.tsv"
os.environ['property_summary_monolingualtext']  = subset_name + ".2.2.6.property_summary_monolingualtext.tsv"
os.environ['property_summary_math']  = subset_name + ".2.2.7property_summary_math.tsv"
os.environ['property_summary_commonsMedia']  = subset_name + ".2.2.8.property_summary_commonsMedia.tsv"
os.environ['property_summary_globe_coordinate']  = subset_name + ".2.2.9.property_summary_globe_coordinate.tsv"
os.environ['property_summary_musical_notation']  = subset_name + ".2.2.10.property_summary_musical_notation.tsv"
os.environ['property_summary_geo_shape']  = subset_name + ".2.2.11.property_summary_geo_shape.tsv"
os.environ['property_summary_url']  = subset_name + ".2.2.12.property_summary_url.tsv"
os.environ['property_summary_string']  = subset_name + ".2.2.13.property_summary_string.tsv"
# Output files for statistics 2.3
os.environ['item_properties']  = subset_name + ".2.3.item_properties.tsv"# Output files for statistics 3.1
# Output files for statistics 2.4
os.environ['time_properties']  = subset_name + ".2.4.1.time_properties.tsv"
os.environ['time_properties_count']  = subset_name + ".2.4.2.time_properties.tsv"
# Output files for statistics 2.5
os.environ['quantity_properties']  = subset_name + ".2.5.quantity_properties.tsv"
# Output files for statistics 2.6
os.environ['distinct_values_math']  = subset_name + ".2.6.1.distinct_values_math.tsv"
os.environ['distinct_values_string']  = subset_name + ".2.6.2.distinct_values_string.tsv"
os.environ['distinct_values_external_id']  = subset_name + ".2.6.3.distinct_values_external_id.tsv"
os.environ['distinct_values_monolingual_text']  = subset_name + ".2.6.4.distinct_values_monolingual_text.tsv"
os.environ['distinct_values_musical_notation']  = subset_name + ".2.6.5.distinct_values_musical_notation.tsv"
# Output files for statistics 2.7
os.environ['globe_coordinate_top_m']  = subset_name + ".2.7.globe_coordinate_top_m.tsv"
# Output files for statistics 2.8
os.environ['geo_shape_top_m']  = subset_name + ".2.8.geo_shape_top_m.tsv"

In [6]:
def run_command(cmd, substitution_dictionary = {}):
    """Run a templetized command."""
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)
    #print(output.returncode)

# 1. Classes
   ## 1.1 Class summary
   ### List of top K classes based on number of instances (Done)
   ### class -- is the qnode corresponding to the class 
   ### pnode -- is the property by which the class is linked
   ### name -- is the label for the class
   ### number of instances -- is the number of instances of the class in wikidata subset
   ### pagerank -- is the page rank of class


In [47]:
cmd = "$kgtk query -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE \
-o $OUTPUT_FOLDER/$class_summary \
--match 'item: (n1)-[l{label:llab}]->(n2), label: (n2)-[:label]->(label),statistics:(n2)-[:vertex_pagerank]->(pagerank) ' \
--return 'distinct n2 as class, label as name, count(n2) as value_count, pagerank as pagerank' \
--where '(label.kgtk_lqstring_lang_suffix = \"en\") AND (llab IN [\"P31\" , \"P279\"]) AND (n2 != \"__subset_name\") ' \
--order-by 'count(n2) desc' \
--limit $K "

run_command(cmd, {"__subset_name": subset_name})

$kgtk query -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE --match 'item: (n1)-[l{label:llab}]->(n2), label: (n2)-[:label]->(label),statistics:(n2)-[:vertex_pagerank]->(pagerank) ' --return 'distinct n2 as class, label as name, count(n2) as value_count, pagerank as pagerank' --where '(label.kgtk_lqstring_lang_suffix = "en") AND (llab IN ["P31" , "P279"]) AND (n2 != "Q11173") ' --order-by 'count(n2) desc' --limit $K 
class	name	value_count	pagerank
Q12140	'medication'@en	5044	0.0001114425917084744
Q63436503	'diacylglycerophosphocholine'@en	970	2.450382596823156e-05
Q187661	'carcinogen'@en	958	2.7154965664207216e-05
Q63446172	'wax monoester'@en	820	2.07372439399413e-05
Q11367	'lipid'@en	704	2.1515981963226294e-05
Q222174	'flavonoid'@en	690	1.8683836337623413e-05
Q35456	'essential medicine'@en	630	1.0921155026395615e-05
Q193430	'heterocyclic compound'@en	572	2.0566341593395852e-05
Q422248	'monoclonal antibody'@en	562	1.32400

In [50]:
df_class_summary = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
df_class_summary

Unnamed: 0,class,name,value_count,pagerank
0,Q12140,'medication'@en,5044,0.000111
1,Q63436503,'diacylglycerophosphocholine'@en,970,2.5e-05
2,Q187661,'carcinogen'@en,958,2.7e-05
3,Q63446172,'wax monoester'@en,820,2.1e-05
4,Q11367,'lipid'@en,704,2.2e-05
5,Q222174,'flavonoid'@en,690,1.9e-05
6,Q35456,'essential medicine'@en,630,1.1e-05
7,Q193430,'heterocyclic compound'@en,572,2.1e-05
8,Q422248,'monoclonal antibody'@en,562,1.3e-05
9,Q2250497,'unsaturated fatty acids'@en,532,1.4e-05


 ### Example wikibase item for each of the class.
 ### qnode -- is the qnode of the example wikibase item
 ### name -- is the label of qnode
 ### pagerank -- is the page rank of the qnode


In [48]:
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE \
    -o $OUTPUT_FOLDER/__output_file \
    --match 'item: (n1)-[l]->(n2:__class), label: (n1)-[:label]->(label), statistics:(n1)-[:vertex_pagerank]->(pagerank)' \
    --return 'n1 as qnode, label as name, pagerank as pagerank' \
    --where 'label.kgtk_lqstring_lang_suffix = \"en\"' \
    --order-by pagerank \
    --limit 3"
    for index, ele in df.iterrows():
        if ele["class"] ==  subset_name:
            continue
        print("Examples for ", ele["class"]," class")
        output_file = subset_name + ".1." + ele["class"]+".examples"+".tsv"
        run_command(cmd, {"__output_file": output_file,"__class": ele["class"]})
except Exception as e:
    print(e)

Examples for  Q12140  class
$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE     --match 'item: (n1)-[l]->(n2:Q12140), label: (n1)-[:label]->(label), statistics:(n1)-[:vertex_pagerank]->(pagerank)'     --return 'n1 as qnode, label as name, pagerank as pagerank'     --where 'label.kgtk_lqstring_lang_suffix = "en"'     --order-by pagerank     --limit 3
qnode	name	pagerank
Q21011228	'dulaglutide'@en	1.0010828324322714e-06
Q21011228	'dulaglutide'@en	1.0010828324322714e-06
Q27270940	'tezacaftor'@en	1.0040699541178656e-06

[2020-10-13 13:00:25 query]: SQL Translation:
---------------------------------------------
  SELECT graph_3_c3."node1" "qnode", graph_2_c2."node2" "name", graph_3_c3."node2" "pagerank"
     FROM graph_1 AS graph_1_c1, graph_2 AS graph_2_c2, graph_3 AS graph_3_c3
     WHERE graph_1_c1."node2"=?
     AND graph_2_c2."label"=?
     AND graph_3_c3."label"=?
     AND graph_1_c1."node1"=graph_2_c2."node1

qnode	name	pagerank
Q420056	'Peginterferon alfa-2a'@en	1.0200345525789516e-06
Q213511	'erythromycin'@en	1.0229494886661334e-05
Q213511	'erythromycin'@en	1.0229494886661334e-05

[2020-10-13 13:00:30 query]: SQL Translation:
---------------------------------------------
  SELECT graph_2_c2."node1" "qnode", graph_2_c2."node2" "name", graph_3_c3."node2" "pagerank"
     FROM graph_1 AS graph_1_c1, graph_2 AS graph_2_c2, graph_3 AS graph_3_c3
     WHERE graph_1_c1."node2"=?
     AND graph_2_c2."label"=?
     AND graph_3_c3."label"=?
     AND graph_1_c1."node1"=graph_2_c2."node1"
     AND graph_1_c1."node1"=graph_3_c3."node1"
     AND (kgtk_lqstring_lang_suffix(graph_2_c2."node2") = ?)
     ORDER BY graph_3_c3."node2" ASC
     LIMIT ?
  PARAS: ['Q35456', 'label', 'vertex_pagerank', 'en', 3]
---------------------------------------------
        0.75 real         0.52 user         0.13 sys

Examples for  Q193430  class
$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i 

In [10]:
df_class_example = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
    for index, ele in df.iterrows():
        if ele["class"] ==  subset_name:
            continue
        output_file = subset_name + ".1." + ele["class"]+".examples"+".tsv"
        df_class_example.append(pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t'))
    for i in range(len(df_class_example)):
        display(df_class_example[i])
        print("-------------------------")
except Exception as e:
    print(e)

Unnamed: 0,qnode,name,pagerank
0,Q21011228,'dulaglutide'@en,1e-06
1,Q21011228,'dulaglutide'@en,1e-06
2,Q27270940,'tezacaftor'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q27105002,"'1-hexadecanoyl-2-(9Z,12Z-octadecadienoyl)-sn-...",1e-06
1,Q27145082,'1-stearoyl-2-palmitoyl-sn-glycero-3-phosphoch...,1e-06
2,Q27145169,'1-palmitoyl-2-acetyl-sn-glycero-3-phosphochol...,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q27289240,'bromodichloroacetic acid'@en,1e-06
1,Q2307855,'glycidaldehyde'@en,1e-06
2,Q27257137,'4-chloromethylbiphenyl'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q27269581,'ethyl-2-nonynoate'@en,1e-06
1,Q27268650,'cetyl myristate'@en,1e-06
2,Q27285351,'2-pentyl butyrate'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q37614694,'C16 DHLactosylceramide (incomplete stereochem...,1.586265e-06
1,Q37614694,'C16 DHLactosylceramide (incomplete stereochem...,1.586265e-06
2,Q37613317,'dihydroceramide-1-phosphorylcholine'@en,1.869609e-07


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q27105590,"'7,3\\\\\\\\'-dihydroxy-4\\\\\\\\'-methoxy-8-m...",1e-06
1,Q27103316,"'3\\\\\\\\',4\\\\\\\\',5,6-tetrahydroxy-3,7-di...",1e-06
2,Q7352935,'robinetinidol'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q420056,'Peginterferon alfa-2a'@en,1e-06
1,Q213511,'erythromycin'@en,1e-05
2,Q213511,'erythromycin'@en,1e-05


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q10322996,'mafosfamide'@en,1e-06
1,Q4492673,'metofenazate'@en,1e-06
2,Q4637151,'4-hydroperoxycyclophosphamide'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q3655009,'Canakinumab'@en,1e-06
1,Q7444755,'Secukinumab'@en,1e-06
2,Q410656,'Ofatumumab'@en,1e-06


-------------------------


Unnamed: 0,qnode,name,pagerank
0,Q27257146,'(3E)-3-octenoic acid'@en,1e-06
1,Q2707986,'paullinic acid'@en,1e-06
2,Q27277849,'4-ethyloctanoic acid'@en,1e-06


-------------------------


# 2. Property

# 2.1 First a summary by property type
## For each type
### pnode, label, count of items with property (see example below)
### Example (show an instance or two of this property)



In [49]:
types = [
    ("time","time","property_summary_time"),
    ("wikibase_item","wikibase_item","property_summary_wikibase_item"),
    ("math","math","property_summary_math"),
    ("wikibase_form","wikibase-form","property_summary_wikibase_form"),
    ("quantity","quantity","property_summary_quantity"),
    ("string","string","property_summary_string"),
    ("external_id","external-id","property_summary_external_id"),
    ("commonsMedia","commonsMedia","property_summary_commonsMedia"),
    ("globe_coordinate","globe-coordinate","property_summary_globe_coordinate"),
    ("monolingualtext","monolingualtext","property_summary_monolingualtext"),
    ("musical_notation","musical-notation","property_summary_musical_notation"),
    ("geo_shape","geo-shape","property_summary_geo_shape"),
    ("url","url","property_summary_url"),
]

cmd = "$kgtk query  -i $WIKIDATA_PARTS/$TYPE_FILE -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$output_file \
--match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' \
--return 'distinct llab as property, count(llab) as number_of_statements, count(n2) as number_of_nodes_having_the_property, label as `label`' \
--where 'label.kgtk_lqstring_lang_suffix = \"en\"' \
--order-by 'count(llab) desc, count(n2) desc' \
--limit $K"

for type,name, output_file in types:
    run_command(cmd, {"TYPE_FILE": type,"output_file":output_file})

$kgtk query  -i $WIKIDATA_PARTS/$time -i $WIKIDATA_PARTS/$label --graph-cache $STORE --match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' --return 'distinct llab as property, count(llab) as number_of_statements, count(n2) as number_of_nodes_having_the_property, label as `label`' --where 'label.kgtk_lqstring_lang_suffix = "en"' --order-by 'count(llab) desc, count(n2) desc' --limit $K
property	number_of_statements	number_of_nodes_having_the_property	label
P575	20	20	'time of discovery or invention'@en
P2669	1	1	'discontinued date'@en
P571	1	1	'inception'@en
P729	1	1	'service entry'@en
P730	1	1	'service retirement'@en

[2020-10-13 13:00:55 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c2."node1" "property", count(graph_2_c2."node1") "number_of_statements", count(graph_5_c1."node2") "number_of_nodes_having_the_property", graph_2_c2."node2" "label"
     FROM graph_2 AS graph_2_c2, graph_5 AS graph_5_c1
     WHERE gra

property	number_of_statements	number_of_nodes_having_the_property	label
P235	997389	997389	'InChIKey'@en
P234	990406	990406	'InChI'@en
P231	927660	927660	'CAS Registry Number'@en
P3117	848585	848585	'DSSTox substance ID'@en
P662	250629	250629	'PubChem CID'@en
P661	125009	125009	'ChemSpider ID'@en
P2840	115816	115816	'NSC number'@en
P683	86055	86055	'ChEBI ID'@en
P652	59120	59120	'UNII'@en
P232	54335	54335	'EC number'@en

[2020-10-13 13:01:05 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_10_c1."label" "property", count(graph_10_c1."label") "number_of_statements", count(graph_10_c1."node2") "number_of_nodes_having_the_property", graph_2_c2."node2" "label"
     FROM graph_10 AS graph_10_c1, graph_2 AS graph_2_c2
     WHERE graph_10_c1."label"=graph_10_c1."label"
     AND graph_2_c2."label"=?
     AND graph_10_c1."label"=graph_2_c2."node1"
     AND (kgtk_lqstring_lang_suffix(graph_2_c2."node2") = ?)
     GROUP BY property
     ORDER BY count

property	number_of_statements	number_of_nodes_having_the_property	label
P2888	41	41	'exact match'@en
P856	26	26	'official website'@en
P6363	4	4	'WordLift URL'@en
P1482	1	1	'Stack Exchange tag'@en
P1709	1	1	'equivalent class'@en
P2078	1	1	'user manual URL'@en
P2699	1	1	'URL'@en
P854	1	1	'reference URL'@en

[2020-10-13 13:01:35 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_16_c1."label" "property", count(graph_16_c1."label") "number_of_statements", count(graph_16_c1."node2") "number_of_nodes_having_the_property", graph_2_c2."node2" "label"
     FROM graph_16 AS graph_16_c1, graph_2 AS graph_2_c2
     WHERE graph_16_c1."label"=graph_16_c1."label"
     AND graph_2_c2."label"=?
     AND graph_16_c1."label"=graph_2_c2."node1"
     AND (kgtk_lqstring_lang_suffix(graph_2_c2."node2") = ?)
     GROUP BY property
     ORDER BY count(graph_16_c1."label") DESC, count(graph_16_c1."node2") DESC
     LIMIT ?
  PARAS: ['label', 'en', 10]
----------------

In [70]:
df_property_summary = []
types = [
    ("time","time","property_summary_time"),
    ("wikibase_item","wikibase_item","property_summary_wikibase_item"),
    ("math","math","property_summary_math"),
    ("wikibase_form","wikibase-form","property_summary_wikibase_form"),
    ("quantity","quantity","property_summary_quantity"),
    ("string","string","property_summary_string"),
    ("external_id","external-id","property_summary_external_id"),
    ("commonsMedia","commonsMedia","property_summary_commonsMedia"),
    ("globe_coordinate","globe-coordinate","property_summary_globe_coordinate"),
    ("monolingualtext","monolingualtext","property_summary_monolingualtext"),
    ("musical_notation","musical-notation","property_summary_musical_notation"),
    ("geo_shape","geo-shape","property_summary_geo_shape"),
    ("url","url","property_summary_url"),
]
for type,name, output_file in types:
    try:
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv(output_file)),delimiter='\t')
    except Exception as e:
        continue
    df_property_summary.append(temp)
    if  len(df_property_summary[-1])>0:
        display(df_property_summary[-1])

Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P575,20,20,'time of discovery or invention'@en
1,P2669,1,1,'discontinued date'@en
2,P571,1,1,'inception'@en
3,P729,1,1,'service entry'@en
4,P730,1,1,'service retirement'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P972,49434,49434,'catalog'@en
1,P527,15973,15973,'has part'@en
2,P361,9430,9430,'part of'@en
3,P2175,6121,6121,'medical condition treated'@en
4,P3780,4178,4178,'active ingredient in'@en
5,P129,3852,3852,'physically interacts with'@en
6,P769,1721,1721,'significant drug interaction'@en
7,P3364,1457,1457,'stereoisomer of'@en
8,P4952,1376,1376,'safety classification and labelling'@en
9,P3489,1353,1353,'pregnancy category'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P274,344854,344854,'chemical formula'@en
1,P233,199853,199853,'canonical SMILES'@en
2,P2017,141589,141589,'isomeric SMILES'@en
3,P373,3463,3463,'Commons category'@en
4,P1931,615,615,'NIOSH Pocket Guide ID'@en
5,P1748,416,416,'NCI Thesaurus ID'@en
6,P1987,378,378,'MCN code'@en
7,P1820,228,228,'Open Food Facts food additive ID'@en
8,P935,75,75,'Commons gallery'@en
9,P920,15,15,'LEM ID'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P235,997389,997389,'InChIKey'@en
1,P234,990406,990406,'InChI'@en
2,P231,927660,927660,'CAS Registry Number'@en
3,P3117,848585,848585,'DSSTox substance ID'@en
4,P662,250629,250629,'PubChem CID'@en
5,P661,125009,125009,'ChemSpider ID'@en
6,P2840,115816,115816,'NSC number'@en
7,P683,86055,86055,'ChEBI ID'@en
8,P652,59120,59120,'UNII'@en
9,P232,54335,54335,'EC number'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P117,12520,12520,'chemical structure'@en
1,P18,1992,1992,'image'@en
2,P8224,20,20,'molecular model'@en
3,P443,17,17,'pronunciation audio'@en
4,P989,6,6,'spoken text audio'@en
5,P692,3,3,'Gene Atlas Image'@en
6,P4896,2,2,'3D model'@en
7,P10,1,1,'video'@en
8,P242,1,1,'locator map image'@en
9,P5555,1,1,'schematic'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P625,1,1,'coordinate location'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P2275,2414,2414,'World Health Organisation International Nonpr...
1,P1705,13,13,'native label'@en
2,P1813,9,9,'short name'@en
3,P2561,4,4,'name'@en
4,P1448,1,1,'official name'@en
5,P1449,1,1,'nickname'@en
6,P1476,1,1,'title'@en
7,P7243,1,1,'pronunciation'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P2888,41,41,'exact match'@en
1,P856,26,26,'official website'@en
2,P6363,4,4,'WordLift URL'@en
3,P1482,1,1,'Stack Exchange tag'@en
4,P1709,1,1,'equivalent class'@en
5,P2078,1,1,'user manual URL'@en
6,P2699,1,1,'URL'@en
7,P854,1,1,'reference URL'@en


# 2.2 Divided by data type of property


## For each data type select top K properties based on number of statements we are finding the units and the value_counts of the unit

## Header
### units -- The qnode corresponding to the unit
### value_Count -- number of instances of the unit

In [12]:
df_quantity_data_type = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_quantity')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count' \
            --where 'kgtk_quantity(v)' \
            --order-by 'count(v) desc' \
            --limit 10 "
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.6." + ele["property"]+".data_types"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        df_quantity_data_type.append(temp)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.6.P2067.data_types.tsv             --match  '(n1)-[r:P2067]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc'             --limit 10 

[2020-10-19 17:38:51 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_1_c1."node2") "units", count(graph_1_c1."node2") "value_count"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND kgtk_quantity(graph_1_c1."node2")
     GROUP BY units
     ORDER BY count(graph_1_c1."node2") DESC
     LIMIT ?
  PARAS: ['P2067', 10]
---------------------------------------------
        4.60 real         2.96 user         0.43 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.6.P2101.data_types.tsv 


[2020-10-19 17:39:01 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_1_c1."node2") "units", count(graph_1_c1."node2") "value_count"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND kgtk_quantity(graph_1_c1."node2")
     GROUP BY units
     ORDER BY count(graph_1_c1."node2") DESC
     LIMIT ?
  PARAS: ['P2128', 10]
---------------------------------------------
        0.92 real         0.58 user         0.16 sys



In [14]:
for ele in df_quantity_data_type:
    display(ele)

Unnamed: 0,units,value_count
0,Q483261,146073
1,Q28924752,457
2,Q14623804,1


Unnamed: 0,units,value_count
0,Q25267,8975
1,Q42289,506
2,Q11579,10
3,,3


Unnamed: 0,units,value_count
0,Q13147228,1095
1,Q844211,8
2,Q834105,7
3,Q21604951,3


Unnamed: 0,units,value_count
0,Q3241121,460
1,Q41803,426
2,Q1645498,18
3,Q2332346,4


Unnamed: 0,units,value_count
0,Q42289,414
1,Q25267,334
2,Q11579,124


Unnamed: 0,units,value_count
0,Q21077820,648
1,Q21006887,67


Unnamed: 0,units,value_count
0,Q6859652,451
1,Q177974,57
2,Q21064807,43
3,Q5139563,31
4,Q44395,17
5,Q103510,4
6,Q185648,2


Unnamed: 0,units,value_count
0,Q21091747,336
1,Q21006887,72
2,Q21077820,45
3,Q21061369,12
4,Q21604951,11
5,Q2332346,1


Unnamed: 0,units,value_count
0,Q21127659,313
1,Q834105,51
2,Q21061369,26
3,Q60606516,20
4,Q55726194,9
5,Q55435387,1
6,Q21091747,1
7,Q21064845,1
8,Q13147228,1


Unnamed: 0,units,value_count
0,Q42289,316
1,Q25267,87
2,Q11579,2


In [24]:
df_quantity_data_type_magnitude = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_quantity')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_quantity_number_int(v) as magnitude, count(v) as value_count' \
            --where 'kgtk_quantity(v)' \
            --order-by 'count(v) desc' "
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.6." + ele["property"]+".data_types_magnitude"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        temp['magnitude'] = pd.cut(temp['magnitude'], bins=10)
        temp = temp['magnitude'].value_counts().rename_axis('magnitude').reset_index(name='counts')
        df_quantity_data_type_magnitude.append(temp)
        temp.to_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),sep='\t',index=False)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.6.P2067.data_types_magnitude.tsv             --match  '(n1)-[r:P2067]->(v)'             --return 'distinct kgtk_quantity_number_int(v) as magnitude, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-19 18:16:48 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_number_int(graph_1_c1."node2") "magnitude", count(graph_1_c1."node2") "value_count"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND kgtk_quantity(graph_1_c1."node2")
     GROUP BY magnitude
     ORDER BY count(graph_1_c1."node2") DESC
  PARAS: ['P2067']
---------------------------------------------
        4.71 real         3.38 user         0.31 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.6.P2101.data_types_magnitude.tsv     

In [25]:
for ele in df_quantity_data_type_magnitude:
    display(ele)

Unnamed: 0,magnitude,counts
0,"(-743.511, 74553.1]",2130
1,"(298206.4, 372757.5]",5
2,"(223655.3, 298206.4]",3
3,"(149104.2, 223655.3]",3
4,"(74553.1, 149104.2]",2
5,"(670961.9, 745513.0]",1
6,"(521859.7, 596410.8]",1
7,"(447308.6, 521859.7]",1
8,"(596410.8, 670961.9]",0
9,"(372757.5, 447308.6]",0


Unnamed: 0,magnitude,counts
0,"(-385.029, 1031.9]",660
1,"(1031.9, 2434.8]",79
2,"(2434.8, 3837.7]",16
3,"(3837.7, 5240.6]",3
4,"(12255.1, 13658.0]",1
5,"(10852.2, 12255.1]",0
6,"(9449.3, 10852.2]",0
7,"(8046.4, 9449.3]",0
8,"(6643.5, 8046.4]",0
9,"(5240.6, 6643.5]",0


Unnamed: 0,magnitude,counts
0,"(-1.9, 190.0]",15
1,"(760.0, 950.0]",5
2,"(1520.0, 1710.0]",2
3,"(1710.0, 1900.0]",1
4,"(1140.0, 1330.0]",1
5,"(1330.0, 1520.0]",0
6,"(950.0, 1140.0]",0
7,"(570.0, 760.0]",0
8,"(380.0, 570.0]",0
9,"(190.0, 380.0]",0


Unnamed: 0,magnitude,counts
0,"(-0.09, 9.0]",10
1,"(9.0, 18.0]",7
2,"(18.0, 27.0]",6
3,"(36.0, 45.0]",4
4,"(81.0, 90.0]",3
5,"(72.0, 81.0]",2
6,"(63.0, 72.0]",2
7,"(45.0, 54.0]",2
8,"(27.0, 36.0]",2
9,"(54.0, 63.0]",1


Unnamed: 0,magnitude,counts
0,"(-320.208, 407.8]",335
1,"(407.8, 1128.6]",177
2,"(1128.6, 1849.4]",24
3,"(1849.4, 2570.2]",11
4,"(2570.2, 3291.0]",9
5,"(4732.6, 5453.4]",3
6,"(3291.0, 4011.8]",3
7,"(6174.2, 6895.0]",2
8,"(4011.8, 4732.6]",2
9,"(5453.4, 6174.2]",1


Unnamed: 0,magnitude,counts
0,"(-9.0, 900.0]",106
1,"(900.0, 1800.0]",13
2,"(1800.0, 2700.0]",6
3,"(2700.0, 3600.0]",4
4,"(5400.0, 6300.0]",3
5,"(6300.0, 7200.0]",2
6,"(3600.0, 4500.0]",2
7,"(8100.0, 9000.0]",1
8,"(7200.0, 8100.0]",1
9,"(4500.0, 5400.0]",1


Unnamed: 0,magnitude,counts
0,"(-200.0, 20000.0]",110
1,"(180000.0, 200000.0]",1
2,"(160000.0, 180000.0]",0
3,"(140000.0, 160000.0]",0
4,"(120000.0, 140000.0]",0
5,"(100000.0, 120000.0]",0
6,"(80000.0, 100000.0]",0
7,"(60000.0, 80000.0]",0
8,"(40000.0, 60000.0]",0
9,"(20000.0, 40000.0]",0


Unnamed: 0,magnitude,counts
0,"(-850.0, 85000.0]",292
1,"(765000.0, 850000.0]",1
2,"(680000.0, 765000.0]",0
3,"(595000.0, 680000.0]",0
4,"(510000.0, 595000.0]",0
5,"(425000.0, 510000.0]",0
6,"(340000.0, 425000.0]",0
7,"(255000.0, 340000.0]",0
8,"(170000.0, 255000.0]",0
9,"(85000.0, 170000.0]",0


Unnamed: 0,magnitude,counts
0,"(-4.5, 450.0]",74
1,"(450.0, 900.0]",7
2,"(4050.0, 4500.0]",2
3,"(900.0, 1350.0]",2
4,"(1800.0, 2250.0]",1
5,"(1350.0, 1800.0]",1
6,"(3600.0, 4050.0]",0
7,"(3150.0, 3600.0]",0
8,"(2700.0, 3150.0]",0
9,"(2250.0, 2700.0]",0


Unnamed: 0,magnitude,counts
0,"(85.5, 149.0]",48
1,"(149.0, 212.5]",40
2,"(22.0, 85.5]",40
3,"(-41.5, 22.0]",34
4,"(212.5, 276.0]",24
5,"(276.0, 339.5]",20
6,"(339.5, 403.0]",10
7,"(-105.635, -41.5]",9
8,"(403.0, 466.5]",8
9,"(466.5, 530.0]",3


## 2.3 Item properties
### Top K qnodes used in node2
### qnode, label, count

## Header

### qnode -- qnode corresponding to node2 
### count -- number of times qnode is present 
### label -- label of the qnode



In [50]:
cmd  = "$kgtk query  -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$item_properties \
--match 'part: (n1)-[l]->(n2), label: (n2)-[:label]->(label)' \
--return 'distinct n2 as qnode, count(n2) as count, label ' \
--where '(label.kgtk_lqstring_lang_suffix = \"en\") AND (n2 != \"__subset_name\")' \
--order-by 'count(n2) desc' \
--limit $K"
run_command(cmd, {"__subset_name":subset_name})

$kgtk query  -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label --graph-cache $STORE --match 'part: (n1)-[l]->(n2), label: (n2)-[:label]->(label)' --return 'distinct n2 as qnode, count(n2) as count, label ' --where '(label.kgtk_lqstring_lang_suffix = "en") AND (n2 != "Q11173")' --order-by 'count(n2) desc' --limit $K
qnode	count	node2
Q90481889	49433	'CAS COVID-19 Anti-Viral Candidate Compounds'@en
Q623	6174	'carbon'@en
Q629	5434	'oxygen'@en
Q12140	5048	'medication'@en
Q627	2418	'nitrogen'@en
Q556	2302	'hydrogen'@en
Q15978631	2146	'Homo sapiens'@en
Q51139288	1305	'NFPA 704: Standard System for the Identification of the Hazards of Materials for Emergency Response'@en
Q682	1246	'sulfur'@en
Q187661	1196	'carcinogen'@en

[2020-10-13 13:01:36 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node2" "qnode", count(graph_1_c1."node2") "count", graph_2_c2."node2"
     FROM graph_1 AS graph_1_c1, graph_2 AS graph_2_c2
     WHERE graph_2

In [13]:
df_item = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('item_properties')),delimiter='\t')
df_item

Unnamed: 0,qnode,count,node2
0,Q90481889,49433,'CAS COVID-19 Anti-Viral Candidate Compounds'@en
1,Q623,6174,'carbon'@en
2,Q629,5434,'oxygen'@en
3,Q12140,5048,'medication'@en
4,Q627,2418,'nitrogen'@en
5,Q556,2302,'hydrogen'@en
6,Q15978631,2146,'Homo sapiens'@en
7,Q51139288,1305,'NFPA 704: Standard System for the Identificat...
8,Q682,1246,'sulfur'@en
9,Q187661,1196,'carcinogen'@en


## 2.4 Time properties
### min time, max time
### Chart: x axis is different times, y axis is count of nodes that have the value. Binning maybe required


In [51]:
!$kgtk query  -i $WIKIDATA_PARTS/$time  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$time_properties \
--match 'part: (n1)-[l]->(time)' \
--return 'min(time) as min_time, max(time) as max_time'

[2020-10-13 13:01:52 query]: SQL Translation:
---------------------------------------------
  SELECT min(graph_5_c1."node2") "min_time", max(graph_5_c1."node2") "max_time"
     FROM graph_5 AS graph_5_c1
  PARAS: []
---------------------------------------------
min_time	max_time
^1669-00-00T00:00:00Z/9	^2013-05-00T00:00:00Z/10
        0.55 real         0.43 user         0.11 sys


## Count of top K distinct time 

## Header

### time -- time
### count --- number of times it is present



In [52]:
!$kgtk query  -i $WIKIDATA_PARTS/$time  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$time_properties_count \
--match 'part: (n1)-[l]->(time)' \
--return 'distinct time as `time`, count(time) as count' \
--order-by 'count(time) desc' \
--limit $K

[2020-10-13 13:02:01 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_5_c1."node2" "time", count(graph_5_c1."node2") "count"
     FROM graph_5 AS graph_5_c1
     GROUP BY time
     ORDER BY count(graph_5_c1."node2") DESC
     LIMIT ?
  PARAS: [10]
---------------------------------------------
time	count
^1856-00-00T00:00:00Z/9	2
^1847-00-00T00:00:00Z/9	2
^2013-05-00T00:00:00Z/10	1
^1993-00-00T00:00:00Z/9	1
^1987-00-00T00:00:00Z/9	1
^1984-00-00T00:00:00Z/9	1
^1970-01-01T00:00:00Z/9	1
^1970-00-00T00:00:00Z/8	1
^1963-00-00T00:00:00Z/9	1
^1961-01-01T00:00:00Z/9	1
        0.56 real         0.44 user         0.11 sys


In [73]:
df_time = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('time_properties_count')),delimiter='\t')
display(df_time)

Unnamed: 0,time,count
0,^1856-00-00T00:00:00Z/9,2
1,^1847-00-00T00:00:00Z/9,2
2,^2013-05-00T00:00:00Z/10,1
3,^1993-00-00T00:00:00Z/9,1
4,^1987-00-00T00:00:00Z/9,1
5,^1984-00-00T00:00:00Z/9,1
6,^1970-01-01T00:00:00Z/9,1
7,^1970-00-00T00:00:00Z/8,1
8,^1963-00-00T00:00:00Z/9,1
9,^1961-01-01T00:00:00Z/9,1


## 2.5 Quantity properties
### list of units and counts
### Chart: x axis is different magnitudes, y axis is count of nodes that have the value. Binning is required

## Header
### units -- units
### counts -- count corresponding to each unit


In [53]:
!$kgtk query  -i $WIKIDATA_PARTS/$quantity -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$quantity_properties \
--match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' \
--return 'distinct label as units, count(llab) as counts'\
--where 'label.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(llab) desc'\
--limit $K

[2020-10-13 13:02:09 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c2."node2" "units", count(graph_8_c1."label") "counts"
     FROM graph_2 AS graph_2_c2, graph_8 AS graph_8_c1
     WHERE graph_2_c2."label"=?
     AND graph_8_c1."label"=graph_8_c1."label"
     AND graph_2_c2."node1"=graph_8_c1."label"
     AND (kgtk_lqstring_lang_suffix(graph_2_c2."node2") = ?)
     GROUP BY units
     ORDER BY count(graph_8_c1."label") DESC
     LIMIT ?
  PARAS: ['label', 'en', 10]
---------------------------------------------
units	counts
'mass'@en	146546
'melting point'@en	9494
'density'@en	1113
'defined daily dose'@en	908
'boiling point'@en	873
'time-weighted average exposure limit'@en	715
'vapor pressure'@en	605
'median lethal dose (LD50)'@en	477
'solubility'@en	433
'flash point'@en	405
        1.63 real         1.37 user         0.13 sys


In [76]:
df_quantity = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('quantity_properties')),delimiter='\t')
display(df_quantity)

Unnamed: 0,units,counts
0,'mass'@en,146546
1,'melting point'@en,9494
2,'density'@en,1113
3,'defined daily dose'@en,908
4,'boiling point'@en,873
5,'time-weighted average exposure limit'@en,715
6,'vapor pressure'@en,605
7,'median lethal dose (LD50)'@en,477
8,'solubility'@en,433
9,'flash point'@en,405


## 2.6 String, Monolingual, url, External id, Math musical notation
### number of distinct values
### list of top M values

## Header
### value -- value
### number_of_statements -- number of statements corresponding to that value.



In [54]:
types = [
    ("math","distinct_values_math"),
    ("string","distinct_values_string"),
    ("external_id","distinct_values_external_id"),
    ("monolingualtext","distinct_values_monolingual_text"),
    ("musical_notation","distinct_values_musical_notation"),
]

cmd = "$kgtk query  -i $WIKIDATA_PARTS/$TYPE_FILE  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$output_file \
--match 'part: (n1)-[l]->(n2)' \
--return 'distinct n2 as value, count(n2) as number_of_statements' \
--order-by 'count(n2) desc' \
--limit $K"

for type, output_file in types:
    run_command(cmd, {"TYPE_FILE": type,"output_file":output_file})

$kgtk query  -i $WIKIDATA_PARTS/$math  --graph-cache $STORE --match 'part: (n1)-[l]->(n2)' --return 'distinct n2 as value, count(n2) as number_of_statements' --order-by 'count(n2) desc' --limit $K
value	number_of_statements

[2020-10-13 13:02:23 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_6_c1."node2" "value", count(graph_6_c1."node2") "number_of_statements"
     FROM graph_6 AS graph_6_c1
     GROUP BY value
     ORDER BY count(graph_6_c1."node2") DESC
     LIMIT ?
  PARAS: [10]
---------------------------------------------
        0.61 real         0.47 user         0.12 sys

$kgtk query  -i $WIKIDATA_PARTS/$string  --graph-cache $STORE --match 'part: (n1)-[l]->(n2)' --return 'distinct n2 as value, count(n2) as number_of_statements' --order-by 'count(n2) desc' --limit $K
value	number_of_statements
"NA"	890
"C₆₁H₁₀₆O₆"	247
"C₆₁H₁₀₄O₆"	245
"C₆₃H₁₁₀O₆"	239
"C₆₁H₁₀₂O₆"	236
"C₆₃H₁₀₈O₆"	235
"C₅₉H₁₀₂O₆"	233
"C₁₅H₂₄"	227
"C₆₁H₁₀₈O₆"	225
"C₆₃

In [78]:
types = [
    ("math","distinct_values_math"),
    ("string","distinct_values_string"),
    ("external_id","distinct_values_external_id"),
    ("monolingualtext","distinct_values_monolingual_text"),
    ("musical_notation","distinct_values_musical_notation"),
]
df_distinct_values = []
for type, output_file in types:
    try :
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv(output_file)),delimiter='\t')
    except Exception as e:
        continue
    df_distinct_values.append(temp)
    display(df_distinct_values[-1])

Unnamed: 0,value,number_of_statements


Unnamed: 0,value,number_of_statements
0,,890
1,C₆₁H₁₀₆O₆,247
2,C₆₁H₁₀₄O₆,245
3,C₆₃H₁₁₀O₆,239
4,C₆₁H₁₀₂O₆,236
5,C₆₃H₁₀₈O₆,235
6,C₅₉H₁₀₂O₆,233
7,C₁₅H₂₄,227
8,C₆₁H₁₀₈O₆,225
9,C₆₃H₁₀₆O₆,222


Unnamed: 0,value,number_of_statements
0,novalue,32
1,3077,24
2,1993,20
3,5J7L,19
4,5J5B,19
5,5JC9,18
6,5J91,18
7,5IT8,18
8,4YBB,18
9,2811,18


Unnamed: 0,value,number_of_statements
0,'zopiclone'@en,2
1,'tomoxetine'@en,2
2,'sulfamoxole'@en,2
3,'pembrolizumab'@en,2
4,'doxycycline'@en,2
5,'chlorphenesin'@en,2
6,'bendroflumethiazide'@en,2
7,'azelastine'@en,2
8,'atomoxetine'@en,2
9,'atenolol'@en,2


Unnamed: 0,value,number_of_statements


## 2.7 Globe coordinate
### map with top M, put circles with radius proportional to the number of nodes that have the coordinate 

## Header
### Coordinate -- Coordinate
### Count -- Number of instances of the coordinate


In [55]:
!$kgtk query  -i $WIKIDATA_PARTS/$globe_coordinate -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$globe_coordinate_top_m \
--match 'part: (n1)-[l]->(n2)' \
--return 'distinct n2 as Coordinate, count(n2)'\
--order-by 'count(n2) desc' \
--limit $K

[2020-10-13 13:02:57 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_12_c1."node2" "Coordinate", count(graph_12_c1."node2")
     FROM graph_12 AS graph_12_c1
     GROUP BY Coordinate
     ORDER BY count(graph_12_c1."node2") DESC
     LIMIT ?
  PARAS: [10]
---------------------------------------------
Coordinate	count(graph_12_c1."node2")
@37.25/27	1
        0.55 real         0.43 user         0.10 sys


In [79]:
df_coordinate = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('globe_coordinate_top_m')),delimiter='\t')
display(df_coordinate)

Unnamed: 0,Coordinate,"count(graph_12_c1.""node2"")"
0,@37.25/27,1


## 2.8 geo-shape
### random sample of M nodes that have the property

## Header
### node1 -- node1
### label -- label
### node2 -- node2


In [80]:
try:
    df = pd.read_csv(os.path.join(os.getenv('WIKIDATA_PARTS'),os.getenv('geo_shape')),delimiter='\t',index_col=False)
    try:
        num_rows = min(int(K[1:-1]),len(df))
    except Exception as e:
        num_rows = min(int(K),len(df))
    df.sample(n=num_rows)
    display(df)
    df.to_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('geo_shape_top_m')),sep='\t',index=False)
except Exception as e:
    print(e)

Unnamed: 0,id,node1,label,node2
