# Generating statistics for subset of Wikidata

This notebook illustrates how to generate statistics for a subset of Wikidata. \
We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound).

Example Dataset wikidata subset: https://drive.google.com/drive/u/1/folders/1KjNwV5M2G3JzCrPgqk_TSx8wTE49O2Sx \
Example Dataset statistics: https://drive.google.com/drive/u/1/folders/1_4Mxd0MAo0l9aR3aInv0YMTJrtneh7HW 

### Example Invocation command

    papermill /Users/shashanksaurabh/Desktop/MS/ISI/isi/kgtk_shashank73744/kgtk/examples/Example_9-Wikidata_Subset_Statistics_.ipynb \
    /Users/shashanksaurabh/Desktop/MS/ISI/isi/kgtk_shashank73744/kgtk/examples/Example_9_output.ipynb \
    -p wikidata_home '/Users/shashanksaurabh/Desktop/Data_isi' \
    -p wikidata_parts_folder '/Users/shashanksaurabh/Desktop/Data_isi/Chemical' \
    -p cache_folder '/Users/shashanksaurabh/Desktop/Data_isi/Temp' \
    -p output_folder '/Users/shashanksaurabh/Desktop/Data_isi/output' \
    -p delete_database 'yes' \
    -p K \"10\" \
    -p subset_name 'Q11173'


# Naming Convention:

## [subset_name].[section].[brief_description].tsv

## subset_name 
It is the qnode corresponding to the  wikidata subset, for example if the wikidata subset refers to the chemical compounds then “subset_name” is ”Q11173”

## section 
1. It is the section corresponding to the “Knowledge Graph Statistics” described above. For example if its class summary then it would be ‘1’. 

2. If there are more than one subsection for the main section then the subsection would be added after a dot(‘.’). For example ‘2.2’ corresponds to the ‘item properties’ under ‘Properties’. 

3. If there are more than one subsection to the parent subsection then the subsections would be recursively added after a dot(‘.’). For example while calculating the statistics of string for the section 2.6, it would be named 2.6.1

## brief_description:
1. Its a brief description just after section 
2. For class summary it’s ‘class_summary’
3. For ‘examples’ it’s the qnode for which 3 examples are shown
4. Its ‘property_summary_[type]’ for each of the data types in section 2.1.
5. Its ‘item_properties’ for section 2.3
6. Its ‘time_properties’ for section 2.4
7. Its ‘quantity_properties’ for section 2.5
8. Its ‘distincs_values_[type]’ for each of the type for section 2.6
9. Its ‘geo_coordinate_top_m’ for 2.7
10. Its ‘geo_shape_top_m’ for for 2.8


In [70]:
wikidata_home = "/Users/shashanksaurabh/Desktop/Data_isi"
# path to folder which contains all files corresponding to the wikidata subset. 
#(For more information on wikidata subset please check Example 8)
wikidata_parts_folder = "/Users/shashanksaurabh/Desktop/Data_isi/Chemical"
# The notebook creates a cache, which stores in the cache_folder. The cache can be deleted after the execution.
cache_folder = "/Users/shashanksaurabh/Desktop/Data_isi/Temp"
# path to the folder where the output (here statistics) would be stored
output_folder = "/Users/shashanksaurabh/Desktop/Data_isi/output"
# delete_database = "yes"
# The statistics also uses the consolidated kgtk file with all the properties and items. This is the path to consolidated file
# In each of statistics top K results are chosen. In the following examples this has been implemented using the --limit attribute.
K = "10"
subset_name = "Q11173"

In [71]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

from IPython.display import display, HTML

# import altair as alt
# alt.renderers.enable('altair_viewer')

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

### Set up environment variables and folders that we need

In [72]:
# path to folder which contains all files corresponding to the wikidata subset. 
#(For more information on wikidata subset please check Example 8)
os.environ['WIKIDATA_PARTS'] = wikidata_parts_folder
# path to the folder where the output (here statistics) would be stored
os.environ['OUTPUT_FOLDER'] = output_folder
# kgtk command to run
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_folder)
os.environ['K'] = K
os.environ['label'] = subset_name + ".label.en.tsv.gz"
# file name corresponding to different part of the subsets.
os.environ['subset_name']  = subset_name
os.environ['external_id']  = subset_name + ".part.external-id.tsv.gz"
os.environ['time']  = subset_name +  ".part.time.tsv.gz "
os.environ['wikibase_item']  = subset_name + ".part.wikibase-item.tsv.gz"
os.environ['quantity']  = subset_name +  ".part.quantity.tsv.gz"
os.environ['statistics']  = subset_name + ".statistics.tsv.gz"
os.environ['wikibase_form']  = subset_name + ".part.wikibase-form.tsv.gz"
os.environ['monolingualtext']  = subset_name + ".part.monolingualtext.tsv.gz"
os.environ['math']  = subset_name + ".part.math.tsv.gz"
os.environ['commonsMedia']  = subset_name + ".part.commonsMedia.tsv.gz"
os.environ['globe_coordinate']  = subset_name + ".part.globe-coordinate.tsv.gz"
os.environ['musical_notation']  = subset_name + ".part.musical-notation.tsv.gz"
os.environ['geo_shape']  = subset_name + ".part.geo-shape.tsv.gz"
os.environ['url']  = subset_name + ".part.url.tsv.gz"
os.environ['string']  = subset_name + ".part.string.tsv.gz"
# Output file for statistics 1.1
os.environ['class_summary']  = subset_name + ".1.class_summary.tsv"
# Output files for statistics 2.1
os.environ['property_summary']  = subset_name + ".2.1.property_summary.tsv"
# Output files for statistics 2.2
os.environ['property_summary_external_id']  = subset_name + ".2.2.1property_summary_external_id.tsv"
os.environ['property_summary_time']  = subset_name + ".2.2.2.property_summary_time.tsv"
os.environ['property_summary_wikibase_item']  = subset_name + ".2.2.3.property_summary_wikibase_item.tsv"
os.environ['property_summary_quantity']  = subset_name + ".2.2.4.property_summary_quantity.tsv"
os.environ['property_summary_wikibase_form']  = subset_name + ".2.2.5.property_summary_wikibase_form.tsv"
os.environ['property_summary_monolingualtext']  = subset_name + ".2.2.6.property_summary_monolingualtext.tsv"
os.environ['property_summary_math']  = subset_name + ".2.2.7property_summary_math.tsv"
os.environ['property_summary_commonsMedia']  = subset_name + ".2.2.8.property_summary_commonsMedia.tsv"
os.environ['property_summary_globe_coordinate']  = subset_name + ".2.2.9.property_summary_globe_coordinate.tsv"
os.environ['property_summary_musical_notation']  = subset_name + ".2.2.10.property_summary_musical_notation.tsv"
os.environ['property_summary_geo_shape']  = subset_name + ".2.2.11.property_summary_geo_shape.tsv"
os.environ['property_summary_url']  = subset_name + ".2.2.12.property_summary_url.tsv"
os.environ['property_summary_string']  = subset_name + ".2.2.13.property_summary_string.tsv"
# Output files for statistics 2.3
os.environ['item_properties']  = subset_name + ".2.3.item_properties.tsv"# Output files for statistics 3.1
# Output files for statistics 2.4
os.environ['time_properties']  = subset_name + ".2.4.1.time_properties.tsv"
os.environ['time_properties_count']  = subset_name + ".2.4.2.time_properties.tsv"
# Output files for statistics 2.5
os.environ['quantity_properties']  = subset_name + ".2.5.quantity_properties.tsv"
# Output files for statistics 2.6
os.environ['distinct_values_math']  = subset_name + ".2.6.1.distinct_values_math.tsv"
os.environ['distinct_values_string']  = subset_name + ".2.6.2.distinct_values_string.tsv"
os.environ['distinct_values_external_id']  = subset_name + ".2.6.3.distinct_values_external_id.tsv"
os.environ['distinct_values_monolingual_text']  = subset_name + ".2.6.4.distinct_values_monolingual_text.tsv"
os.environ['distinct_values_musical_notation']  = subset_name + ".2.6.5.distinct_values_musical_notation.tsv"
# Output files for statistics 2.7
os.environ['globe_coordinate_top_m']  = subset_name + ".2.7.globe_coordinate_top_m.tsv"
# Output files for statistics 2.8
os.environ['geo_shape_top_m']  = subset_name + ".2.8.geo_shape_top_m.tsv"

In [73]:
def run_command(cmd, substitution_dictionary = {}):
    """Run a templetized command."""
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)
    #print(output.returncode)

In [74]:
def find_statistics(df,column):
    n = df[column].count()
    print("Statistics for : "+column)
    unique = df[column].nunique()
    unique_percent = round((unique/n)*100,2)
    print("Distinct : "+str(unique))
    print("Distinct(%) :"+str(unique_percent)+"%")
    missing = df[column].isnull().sum()
    missing_percent = round((missing/n)*100,2)
    print("Missing : "+str(missing))
    print("Missing(%) : "+str(missing_percent)+"%")
    infinite = df[column].isna().sum()
    infinite_percent = round((infinite/n)*100,2)
    print("Infinite : "+str(infinite))
    print("Infinite(%) : "+str(infinite_percent)+"%")
    mean = df[column].mean()
    print("Mean : "+str(mean))
    max_ = df[column].max()
    print("Max : "+str(max_))
    min_ = df[column].min()
    print("Min : "+str(min_))
    zeros_ = (df[column]==0).sum()
    zeros_percent = round((zeros_/n)*100,2)
    print("Zeros : "+str(zeros_))
    print("Zeros(%) : "+str(unique_percent)+"%")
    

# 1. Classes
   ## 1.1 Class summary
   ### List of top K classes based on number of instances (Done)
   ### class -- is the qnode corresponding to the class 
   ### pnode -- is the property by which the class is linked
   ### name -- is the label for the class
   ### number of instances -- is the number of instances of the class in wikidata subset
   ### pagerank -- is the page rank of class


In [75]:
cmd = "$kgtk query -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE \
-o $OUTPUT_FOLDER/$class_summary \
--match 'item: (n1)-[l{label:llab}]->(n2), label: (n2)-[:label]->(label),statistics:(n2)-[:vertex_pagerank]->(pagerank) ' \
--return 'distinct n2 as class, label as name, count(n2) as value_count, pagerank as pagerank' \
--where '(label.kgtk_lqstring_lang_suffix = \"en\") AND (llab IN [\"P31\" , \"P279\"]) AND (n2 != \"__subset_name\") ' \
--order-by 'count(n2) desc'"

run_command(cmd, {"__subset_name": subset_name})

$kgtk query -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE -o $OUTPUT_FOLDER/$class_summary --match 'item: (n1)-[l{label:llab}]->(n2), label: (n2)-[:label]->(label),statistics:(n2)-[:vertex_pagerank]->(pagerank) ' --return 'distinct n2 as class, label as name, count(n2) as value_count, pagerank as pagerank' --where '(label.kgtk_lqstring_lang_suffix = "en") AND (llab IN ["P31" , "P279"]) AND (n2 != "Q11173") ' --order-by 'count(n2) desc'

[2020-10-20 13:01:31 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_20_c2."node1" "class", graph_20_c2."node2" "name", count(graph_20_c2."node1") "value_count", graph_21_c3."node2" "pagerank"
     FROM graph_19 AS graph_19_c1, graph_20 AS graph_20_c2, graph_21 AS graph_21_c3
     WHERE graph_19_c1."label"=graph_19_c1."label"
     AND graph_20_c2."label"=?
     AND graph_21_c3."label"=?
     AND graph_19_c1."node2"=graph_20_c2."node1"
     AN

In [76]:
df_class_summary = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
df_class_summary.head()

Unnamed: 0,class,name,value_count,pagerank
0,Q12140,'medication'@en,5044,0.000111
1,Q63436503,'diacylglycerophosphocholine'@en,970,2.5e-05
2,Q187661,'carcinogen'@en,958,2.7e-05
3,Q63446172,'wax monoester'@en,820,2.1e-05
4,Q11367,'lipid'@en,704,2.2e-05


## Statistics for 'value_count' column

In [77]:
find_statistics(df_class_summary,"value_count")

Statistics for : value_count
Distinct : 106
Distinct(%) :4.63%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 17.351374945438675
Max : 5044
Min : 2
Zeros : 0
Zeros(%) : 4.63%


## Statistics for 'pagerank' column

In [27]:
find_statistics(df_class_summary,"pagerank")

Statistics for : pagerank
Distinct : 2288
Distinct(%) :99.87%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.5944703872654804e-06
Max : 0.00012463927532536587
Min : 9.33084945063929e-08
Zeros : 0
Zeros(%) : 99.87%


 ### Example wikibase item for each of the class.
 ### qnode -- is the qnode of the example wikibase item
 ### name -- is the label of qnode
 ### pagerank -- is the page rank of the qnode


In [31]:
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE \
    -o $OUTPUT_FOLDER/__output_file \
    --match 'item: (n1)-[l]->(n2:__class), label: (n1)-[:label]->(label), statistics:(n1)-[:vertex_pagerank]->(pagerank)' \
    --return 'n1 as qnode, label as name, pagerank as pagerank' \
    --where 'label.kgtk_lqstring_lang_suffix = \"en\"' \
    --order-by pagerank \
    --limit 3"
    for index, ele in df.iterrows():
        if index>10:
            break
        if ele["class"] ==  subset_name:
            continue
        print("Examples for ", ele["class"]," class")
        output_file = subset_name + ".1." + ele["class"]+".examples"+".tsv"
        run_command(cmd, {"__output_file": output_file,"__class": ele["class"]})
except Exception as e:
    print(e)

Examples for  Q12140  class
$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE     -o $OUTPUT_FOLDER/Q11173.1.Q12140.examples.tsv     --match 'item: (n1)-[l]->(n2:Q12140), label: (n1)-[:label]->(label), statistics:(n1)-[:vertex_pagerank]->(pagerank)'     --return 'n1 as qnode, label as name, pagerank as pagerank'     --where 'label.kgtk_lqstring_lang_suffix = "en"'     --order-by pagerank     --limit 3

[2020-10-20 11:35:00 query]: SQL Translation:
---------------------------------------------
  SELECT graph_19_c1."node1" "qnode", graph_20_c2."node2" "name", graph_21_c3."node2" "pagerank"
     FROM graph_19 AS graph_19_c1, graph_20 AS graph_20_c2, graph_21 AS graph_21_c3
     WHERE graph_19_c1."node2"=?
     AND graph_20_c2."label"=?
     AND graph_21_c3."label"=?
     AND graph_19_c1."node1"=graph_20_c2."node1"
     AND graph_20_c2."node1"=graph_21_c3."node1"
     AND (kgtk_lqstring_lang_suffix(graph_20_c2."node


[2020-10-20 11:35:06 query]: SQL Translation:
---------------------------------------------
  SELECT graph_20_c2."node1" "qnode", graph_20_c2."node2" "name", graph_21_c3."node2" "pagerank"
     FROM graph_19 AS graph_19_c1, graph_20 AS graph_20_c2, graph_21 AS graph_21_c3
     WHERE graph_19_c1."node2"=?
     AND graph_20_c2."label"=?
     AND graph_21_c3."label"=?
     AND graph_19_c1."node1"=graph_20_c2."node1"
     AND graph_19_c1."node1"=graph_21_c3."node1"
     AND (kgtk_lqstring_lang_suffix(graph_20_c2."node2") = ?)
     ORDER BY graph_21_c3."node2" ASC
     LIMIT ?
  PARAS: ['Q193430', 'label', 'vertex_pagerank', 'en', 3]
---------------------------------------------
        0.68 real         0.50 user         0.13 sys

Examples for  Q422248  class
$kgtk query -i $WIKIDATA_PARTS/$wikibase_item  -i $WIKIDATA_PARTS/$label -i $WIKIDATA_PARTS/$statistics --graph-cache $STORE     -o $OUTPUT_FOLDER/Q11173.1.Q422248.examples.tsv     --match 'item: (n1)-[l]->(n2:Q422248), label: (n1)-[

In [33]:
df_class_example = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('class_summary')),delimiter='\t')
    for index, ele in df.iterrows():
        if index>10:
            break
        if ele["class"] ==  subset_name:
            continue
        output_file = subset_name + ".1." + ele["class"]+".examples"+".tsv"
        df_class_example.append(pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t'))
    for i in range(len(df_class_example)):
        print("Example for "+ele["class"])
        display(df_class_example[i])
        print("-------------------------")
except Exception as e:
    print(e)

Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q21011228,'dulaglutide'@en,1e-06
1,Q21011228,'dulaglutide'@en,1e-06
2,Q27270940,'tezacaftor'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27105002,"'1-hexadecanoyl-2-(9Z,12Z-octadecadienoyl)-sn-...",1e-06
1,Q27145082,'1-stearoyl-2-palmitoyl-sn-glycero-3-phosphoch...,1e-06
2,Q27145169,'1-palmitoyl-2-acetyl-sn-glycero-3-phosphochol...,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27289240,'bromodichloroacetic acid'@en,1e-06
1,Q2307855,'glycidaldehyde'@en,1e-06
2,Q27257137,'4-chloromethylbiphenyl'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27269581,'ethyl-2-nonynoate'@en,1e-06
1,Q27268650,'cetyl myristate'@en,1e-06
2,Q27285351,'2-pentyl butyrate'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q37614694,'C16 DHLactosylceramide (incomplete stereochem...,1.586265e-06
1,Q37614694,'C16 DHLactosylceramide (incomplete stereochem...,1.586265e-06
2,Q37613317,'dihydroceramide-1-phosphorylcholine'@en,1.869609e-07


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27105590,"'7,3\\\\\\\\'-dihydroxy-4\\\\\\\\'-methoxy-8-m...",1e-06
1,Q27103316,"'3\\\\\\\\',4\\\\\\\\',5,6-tetrahydroxy-3,7-di...",1e-06
2,Q7352935,'robinetinidol'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q420056,'Peginterferon alfa-2a'@en,1e-06
1,Q213511,'erythromycin'@en,1e-05
2,Q213511,'erythromycin'@en,1e-05


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q10322996,'mafosfamide'@en,1e-06
1,Q4492673,'metofenazate'@en,1e-06
2,Q4637151,'4-hydroperoxycyclophosphamide'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q3655009,'Canakinumab'@en,1e-06
1,Q7444755,'Secukinumab'@en,1e-06
2,Q410656,'Ofatumumab'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27257146,'(3E)-3-octenoic acid'@en,1e-06
1,Q2707986,'paullinic acid'@en,1e-06
2,Q27277849,'4-ethyloctanoic acid'@en,1e-06


-------------------------
Example for Q134856


Unnamed: 0,qnode,name,pagerank
0,Q27098079,'icosanoyl-CoA'@en,1e-06
1,Q27098086,'pentanoyl-CoA'@en,1e-06
2,Q27117097,'heptanoyl-CoA'@en,1e-06


-------------------------


# 2. Property

# 2.1 First a summary by property type
## For each type
### pnode, label, count of items with property (see example below)
### Example (show an instance or two of this property)



In [34]:
types = [
    ("time","time","property_summary_time"),
    ("wikibase_item","wikibase_item","property_summary_wikibase_item"),
    ("math","math","property_summary_math"),
    ("wikibase_form","wikibase-form","property_summary_wikibase_form"),
    ("quantity","quantity","property_summary_quantity"),
    ("string","string","property_summary_string"),
    ("external_id","external-id","property_summary_external_id"),
    ("commonsMedia","commonsMedia","property_summary_commonsMedia"),
    ("globe_coordinate","globe-coordinate","property_summary_globe_coordinate"),
    ("monolingualtext","monolingualtext","property_summary_monolingualtext"),
    ("musical_notation","musical-notation","property_summary_musical_notation"),
    ("geo_shape","geo-shape","property_summary_geo_shape"),
    ("url","url","property_summary_url"),
]

cmd = "$kgtk query  -i $WIKIDATA_PARTS/$TYPE_FILE -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$output_file \
--match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' \
--return 'distinct llab as property, count(llab) as number_of_statements, count(n2) as number_of_nodes_having_the_property, label as `label`' \
--where 'label.kgtk_lqstring_lang_suffix = \"en\"' \
--order-by 'count(llab) desc, count(n2) desc'"

for type,name, output_file in types:
    run_command(cmd, {"TYPE_FILE": type,"output_file":output_file})

$kgtk query  -i $WIKIDATA_PARTS/$time -i $WIKIDATA_PARTS/$label --graph-cache $STORE -o $OUTPUT_FOLDER/$property_summary_time --match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' --return 'distinct llab as property, count(llab) as number_of_statements, count(n2) as number_of_nodes_having_the_property, label as `label`' --where 'label.kgtk_lqstring_lang_suffix = "en"' --order-by 'count(llab) desc, count(n2) desc'

[2020-10-20 11:43:36 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_17_c1."label" "property", count(graph_17_c1."label") "number_of_statements", count(graph_17_c1."node2") "number_of_nodes_having_the_property", graph_20_c2."node2" "label"
     FROM graph_17 AS graph_17_c1, graph_20 AS graph_20_c2
     WHERE graph_17_c1."label"=graph_17_c1."label"
     AND graph_20_c2."label"=?
     AND graph_17_c1."label"=graph_20_c2."node1"
     AND (kgtk_lqstring_lang_suffix(graph_20_c2."node2") = ?)
     GROUP BY proper


[2020-10-20 11:44:26 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_20_c2."node1" "property", count(graph_20_c2."node1") "number_of_statements", count(graph_25_c1."node2") "number_of_nodes_having_the_property", graph_20_c2."node2" "label"
     FROM graph_20 AS graph_20_c2, graph_25 AS graph_25_c1
     WHERE graph_20_c2."label"=?
     AND graph_25_c1."label"=graph_20_c2."node1"
     AND graph_20_c2."node1"=graph_25_c1."label"
     AND (kgtk_lqstring_lang_suffix(graph_20_c2."node2") = ?)
     GROUP BY property
     ORDER BY count(graph_20_c2."node1") DESC, count(graph_25_c1."node2") DESC
  PARAS: ['label', 'en']
---------------------------------------------
        1.54 real         0.80 user         0.21 sys

$kgtk query  -i $WIKIDATA_PARTS/$globe_coordinate -i $WIKIDATA_PARTS/$label --graph-cache $STORE -o $OUTPUT_FOLDER/$property_summary_globe_coordinate --match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' --retu

In [39]:
df_property_summary = []
types = [
    ("time","time","property_summary_time"),
    ("wikibase_item","wikibase_item","property_summary_wikibase_item"),
    ("math","math","property_summary_math"),
    ("wikibase_form","wikibase-form","property_summary_wikibase_form"),
    ("quantity","quantity","property_summary_quantity"),
    ("string","string","property_summary_string"),
    ("external_id","external-id","property_summary_external_id"),
    ("commonsMedia","commonsMedia","property_summary_commonsMedia"),
    ("globe_coordinate","globe-coordinate","property_summary_globe_coordinate"),
    ("monolingualtext","monolingualtext","property_summary_monolingualtext"),
    ("musical_notation","musical-notation","property_summary_musical_notation"),
    ("geo_shape","geo-shape","property_summary_geo_shape"),
    ("url","url","property_summary_url"),
]
for type,name, output_file in types:
    try:
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv(output_file)),delimiter='\t')
    except Exception as e:
        continue
    df_property_summary.append(temp)
    if  len(df_property_summary[-1])>0:
        display(df_property_summary[-1].head())
        df_property_summary[-1].describe()

Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P575,20,20,'time of discovery or invention'@en
1,P2669,1,1,'discontinued date'@en
2,P571,1,1,'inception'@en
3,P729,1,1,'service entry'@en
4,P730,1,1,'service retirement'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P972,49434,49434,'catalog'@en
1,P527,15973,15973,'has part'@en
2,P361,9430,9430,'part of'@en
3,P2175,6121,6121,'medical condition treated'@en
4,P3780,4178,4178,'active ingredient in'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P2067,146546,146546,'mass'@en
1,P2101,9494,9494,'melting point'@en
2,P2054,1113,1113,'density'@en
3,P4250,908,908,'defined daily dose'@en
4,P2102,873,873,'boiling point'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P274,344854,344854,'chemical formula'@en
1,P233,199853,199853,'canonical SMILES'@en
2,P2017,141589,141589,'isomeric SMILES'@en
3,P373,3463,3463,'Commons category'@en
4,P1931,615,615,'NIOSH Pocket Guide ID'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P235,997389,997389,'InChIKey'@en
1,P234,990406,990406,'InChI'@en
2,P231,927660,927660,'CAS Registry Number'@en
3,P3117,848585,848585,'DSSTox substance ID'@en
4,P662,250629,250629,'PubChem CID'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P117,12520,12520,'chemical structure'@en
1,P18,1992,1992,'image'@en
2,P8224,20,20,'molecular model'@en
3,P443,17,17,'pronunciation audio'@en
4,P989,6,6,'spoken text audio'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P625,1,1,'coordinate location'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P2275,2414,2414,'World Health Organisation International Nonpr...
1,P1705,13,13,'native label'@en
2,P1813,9,9,'short name'@en
3,P2561,4,4,'name'@en
4,P1448,1,1,'official name'@en


Unnamed: 0,property,number_of_statements,number_of_nodes_having_the_property,label
0,P2888,41,41,'exact match'@en
1,P856,26,26,'official website'@en
2,P6363,4,4,'WordLift URL'@en
3,P1482,1,1,'Stack Exchange tag'@en
4,P1709,1,1,'equivalent class'@en


# 2.2 Divided by data type of property


## For each data type select top K properties based on number of statements we are finding the units and the value_counts of the unit

## Header
### units -- The qnode corresponding to the unit
### value_Count -- number of instances of the unit

In [40]:
df_quantity_data_type = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_quantity')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count' \
            --where 'kgtk_quantity(v)' \
            --order-by 'count(v) desc' "
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.5." + ele["property"]+".data_types"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        df_quantity_data_type.append(temp)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2067.data_types.tsv             --match  '(n1)-[r:P2067]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:50:35 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2067']
---------------------------------------------
        3.88 real         3.05 user         0.22 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2101.data_types.tsv             --match  '(n1)-[r:P21


[2020-10-20 11:50:45 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2129']
---------------------------------------------
        0.61 real         0.47 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2260.data_types.tsv             --match  '(n1)-[r:P2260]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:50:46 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni


[2020-10-20 11:50:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2203']
---------------------------------------------
        0.58 real         0.46 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P1109.data_types.tsv             --match  '(n1)-[r:P1109]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:50:53 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni


[2020-10-20 11:50:59 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2160']
---------------------------------------------
        0.62 real         0.48 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2201.data_types.tsv             --match  '(n1)-[r:P2201]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:51:00 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni


[2020-10-20 11:51:05 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2056']
---------------------------------------------
        0.58 real         0.46 user         0.10 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2231.data_types.tsv             --match  '(n1)-[r:P2231]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:51:06 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni


[2020-10-20 11:51:12 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P1088']
---------------------------------------------
        0.82 real         0.60 user         0.16 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2406.data_types.tsv             --match  '(n1)-[r:P2406]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:51:13 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni


[2020-10-20 11:51:18 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "units", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY units
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P5929']
---------------------------------------------
        0.58 real         0.45 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P6076.data_types.tsv             --match  '(n1)-[r:P6076]->(v)'             --return 'distinct kgtk_quantity_wd_units(v) as units, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:51:19 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_16_c1."node2") "uni

## The file is loaded in the data frame and along with that I have also displayed the statistics for 'value_count'

In [44]:
for ele in df_quantity_data_type:
    display(ele)
    find_statistics(ele,"value_count")

Unnamed: 0,units,value_count
0,Q483261,146073
1,Q28924752,457
2,Q14623804,1


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 48843.666666666664
Max : 146073
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q25267,8975
1,Q42289,506
2,Q11579,10
3,,3


Statistics for : value_count
Distinct : 4
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2373.5
Max : 8975
Min : 3
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q13147228,1095
1,Q844211,8
2,Q834105,7
3,Q21604951,3


Statistics for : value_count
Distinct : 4
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 278.25
Max : 1095
Min : 3
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q3241121,460
1,Q41803,426
2,Q1645498,18
3,Q2332346,4


Statistics for : value_count
Distinct : 4
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 227.0
Max : 460
Min : 4
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q42289,414
1,Q25267,334
2,Q11579,124


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 290.6666666666667
Max : 414
Min : 124
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21077820,648
1,Q21006887,67


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 357.5
Max : 648
Min : 67
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q6859652,451
1,Q177974,57
2,Q21064807,43
3,Q5139563,31
4,Q44395,17
5,Q103510,4
6,Q185648,2


Statistics for : value_count
Distinct : 7
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 86.42857142857143
Max : 451
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21091747,336
1,Q21006887,72
2,Q21077820,45
3,Q21061369,12
4,Q21604951,11
5,Q2332346,1


Statistics for : value_count
Distinct : 6
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 79.5
Max : 336
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21127659,313
1,Q834105,51
2,Q21061369,26
3,Q60606516,20
4,Q55726194,9
5,Q55435387,1
6,Q21091747,1
7,Q21064845,1
8,Q13147228,1


Statistics for : value_count
Distinct : 6
Distinct(%) :66.67%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 47.0
Max : 313
Min : 1
Zeros : 0
Zeros(%) : 66.67%


Unnamed: 0,units,value_count
0,Q42289,316
1,Q25267,87
2,Q11579,2


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 135.0
Max : 316
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21077820,341
1,Q21006887,5
2,Q834105,1


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 115.66666666666667
Max : 341
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q83327,280


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 280.0
Max : 280
Min : 280
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q752197,214
1,Q13035094,13


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 113.5
Max : 214
Min : 13
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,,121


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 121.0
Max : 121
Min : 121
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21077820,156
1,Q21006887,52
2,Q21091747,1


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 69.66666666666667
Max : 156
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q2080811,206
1,Q21604951,2


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 104.0
Max : 206
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21091747,108
1,Q21006887,51
2,Q21077820,27
3,Q21061369,5
4,Q21604951,2
5,Q7140852,1
6,Q21075844,1


Statistics for : value_count
Distinct : 6
Distinct(%) :85.71%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 27.857142857142858
Max : 108
Min : 1
Zeros : 0
Zeros(%) : 85.71%


Unnamed: 0,units,value_count
0,Q25267,129
1,Q42289,59
2,Q11579,2


Statistics for : value_count
Distinct : 3
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 63.333333333333336
Max : 129
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21077820,182
1,Q21006887,3


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 92.5
Max : 182
Min : 3
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count
0,Q2080811,174


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 174.0
Max : 174
Min : 174
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,,26


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 26.0
Max : 26
Min : 26
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q26158194,55
1,Q751310,13


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 34.0
Max : 55
Min : 13
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q25267,59


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 59.0
Max : 59
Min : 59
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count
0,Q182429,51


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 51.0
Max : 51
Min : 51
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q28719934,51


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 51.0
Max : 51
Min : 51
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q28719934,51


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 51.0
Max : 51
Min : 51
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q40603,46
1,Q28739766,4


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 25.0
Max : 46
Min : 4
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q25235,13
1,Q573,10
2,Q7727,8
3,Q1092296,7
4,Q11574,4
5,Q23387,1


Statistics for : value_count
Distinct : 6
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 7.166666666666667
Max : 13
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q20966455,38


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 38.0
Max : 38
Min : 38
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q26156132,31
1,Q26156113,6


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 18.5
Max : 31
Min : 6
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,,12


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 12.0
Max : 12
Min : 12
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q42289,15
1,Q25267,11


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 13.0
Max : 15
Min : 11
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q752197,18
1,Q6408112,1
2,Q13035094,1


Statistics for : value_count
Distinct : 2
Distinct(%) :66.67%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 6.666666666666667
Max : 18
Min : 1
Zeros : 0
Zeros(%) : 66.67%


Unnamed: 0,units,value_count
0,Q752197,16
1,Q13035094,1


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 8.5
Max : 16
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,,14


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 14.0
Max : 14
Min : 14
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q20966455,8
1,Q3085309,4


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 6.0
Max : 8
Min : 4
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q182429,12


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 12.0
Max : 12
Min : 12
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q28390,12


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 12.0
Max : 12
Min : 12
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q1463969,11


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 11.0
Max : 11
Min : 11
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q752197,11


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 11.0
Max : 11
Min : 11
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count
0,Q26162546,4
1,Q26162545,2
2,Q26162530,2


Statistics for : value_count
Distinct : 2
Distinct(%) :66.67%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.6666666666666665
Max : 4
Min : 2
Zeros : 0
Zeros(%) : 66.67%


Unnamed: 0,units,value_count
0,Q21604951,8


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 8.0
Max : 8
Min : 8
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q4917,8


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 8.0
Max : 8
Min : 8
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count
0,Q21077820,5


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 5.0
Max : 5
Min : 5
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q55726194,3
1,Q21006887,2


Statistics for : value_count
Distinct : 2
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.5
Max : 3
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q191118,2


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.0
Max : 2
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q20966435,2


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.0
Max : 2
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q21091747,2


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.0
Max : 2
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q53448922,2


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.0
Max : 2
Min : 2
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count


Statistics for : value_count
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,units,value_count
0,Q21091747,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q55726194,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q11229,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,units,value_count
0,Q11229,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


## In this cell we are finding the distribution of different range of magnitudes.

In [45]:
df_quantity_data_type_magnitude = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_quantity')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_quantity_number_int(v) as magnitude, count(v) as value_count' \
            --where 'kgtk_quantity(v)' \
            --order-by 'count(v) desc' "
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.5." + ele["property"]+".data_types_magnitude"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        temp['magnitude'] = pd.cut(temp['magnitude'], bins=10)
        temp = temp['magnitude'].value_counts().rename_axis('magnitude').reset_index(name='counts')
        df_quantity_data_type_magnitude.append(temp)
        temp.to_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),sep='\t',index=False)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2067.data_types_magnitude.tsv             --match  '(n1)-[r:P2067]->(v)'             --return 'distinct kgtk_quantity_number_int(v) as magnitude, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:53:55 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_number_int(graph_16_c1."node2") "magnitude", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY magnitude
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2067']
---------------------------------------------
        2.65 real         2.46 user         0.14 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2101.data_types_magnitude.t


[2020-10-20 11:54:04 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_number_int(graph_16_c1."node2") "magnitude", count(graph_16_c1."node2") "value_count"
     FROM graph_16 AS graph_16_c1
     WHERE graph_16_c1."label"=?
     AND kgtk_quantity(graph_16_c1."node2")
     GROUP BY magnitude
     ORDER BY count(graph_16_c1."node2") DESC
  PARAS: ['P2129']
---------------------------------------------
        0.69 real         0.52 user         0.13 sys

$kgtk query -i $WIKIDATA_PARTS/$quantity --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.5.P2260.data_types_magnitude.tsv             --match  '(n1)-[r:P2260]->(v)'             --return 'distinct kgtk_quantity_number_int(v) as magnitude, count(v) as value_count'             --where 'kgtk_quantity(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:54:05 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_number_i

## In the below cell, all the computed files are loaded in the data frame and also the statistics of the 'count' is printed.

In [46]:
for ele in df_quantity_data_type_magnitude:
    display(ele)
    find_statistics(ele,"counts")

Unnamed: 0,magnitude,counts
0,"(-743.511, 74553.1]",2130
1,"(298206.4, 372757.5]",5
2,"(223655.3, 298206.4]",3
3,"(149104.2, 223655.3]",3
4,"(74553.1, 149104.2]",2
5,"(670961.9, 745513.0]",1
6,"(521859.7, 596410.8]",1
7,"(447308.6, 521859.7]",1
8,"(596410.8, 670961.9]",0
9,"(372757.5, 447308.6]",0


Statistics for : counts
Distinct : 6
Distinct(%) :60.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 214.6
Max : 2130
Min : 0
Zeros : 2
Zeros(%) : 60.0%


Unnamed: 0,magnitude,counts
0,"(-385.029, 1031.9]",660
1,"(1031.9, 2434.8]",79
2,"(2434.8, 3837.7]",16
3,"(3837.7, 5240.6]",3
4,"(12255.1, 13658.0]",1
5,"(10852.2, 12255.1]",0
6,"(9449.3, 10852.2]",0
7,"(8046.4, 9449.3]",0
8,"(6643.5, 8046.4]",0
9,"(5240.6, 6643.5]",0


Statistics for : counts
Distinct : 6
Distinct(%) :60.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 75.9
Max : 660
Min : 0
Zeros : 5
Zeros(%) : 60.0%


Unnamed: 0,magnitude,counts
0,"(-1.9, 190.0]",15
1,"(760.0, 950.0]",5
2,"(1520.0, 1710.0]",2
3,"(1710.0, 1900.0]",1
4,"(1140.0, 1330.0]",1
5,"(1330.0, 1520.0]",0
6,"(950.0, 1140.0]",0
7,"(570.0, 760.0]",0
8,"(380.0, 570.0]",0
9,"(190.0, 380.0]",0


Statistics for : counts
Distinct : 5
Distinct(%) :50.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2.4
Max : 15
Min : 0
Zeros : 5
Zeros(%) : 50.0%


Unnamed: 0,magnitude,counts
0,"(-0.09, 9.0]",10
1,"(9.0, 18.0]",7
2,"(18.0, 27.0]",6
3,"(36.0, 45.0]",4
4,"(81.0, 90.0]",3
5,"(72.0, 81.0]",2
6,"(63.0, 72.0]",2
7,"(45.0, 54.0]",2
8,"(27.0, 36.0]",2
9,"(54.0, 63.0]",1


Statistics for : counts
Distinct : 7
Distinct(%) :70.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 3.9
Max : 10
Min : 1
Zeros : 0
Zeros(%) : 70.0%


Unnamed: 0,magnitude,counts
0,"(-320.208, 407.8]",335
1,"(407.8, 1128.6]",177
2,"(1128.6, 1849.4]",24
3,"(1849.4, 2570.2]",11
4,"(2570.2, 3291.0]",9
5,"(4732.6, 5453.4]",3
6,"(3291.0, 4011.8]",3
7,"(6174.2, 6895.0]",2
8,"(4011.8, 4732.6]",2
9,"(5453.4, 6174.2]",1


Statistics for : counts
Distinct : 8
Distinct(%) :80.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 56.7
Max : 335
Min : 1
Zeros : 0
Zeros(%) : 80.0%


Unnamed: 0,magnitude,counts
0,"(-9.0, 900.0]",106
1,"(900.0, 1800.0]",13
2,"(1800.0, 2700.0]",6
3,"(2700.0, 3600.0]",4
4,"(5400.0, 6300.0]",3
5,"(6300.0, 7200.0]",2
6,"(3600.0, 4500.0]",2
7,"(8100.0, 9000.0]",1
8,"(7200.0, 8100.0]",1
9,"(4500.0, 5400.0]",1


Statistics for : counts
Distinct : 7
Distinct(%) :70.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 13.9
Max : 106
Min : 1
Zeros : 0
Zeros(%) : 70.0%


Unnamed: 0,magnitude,counts
0,"(-200.0, 20000.0]",110
1,"(180000.0, 200000.0]",1
2,"(160000.0, 180000.0]",0
3,"(140000.0, 160000.0]",0
4,"(120000.0, 140000.0]",0
5,"(100000.0, 120000.0]",0
6,"(80000.0, 100000.0]",0
7,"(60000.0, 80000.0]",0
8,"(40000.0, 60000.0]",0
9,"(20000.0, 40000.0]",0


Statistics for : counts
Distinct : 3
Distinct(%) :30.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 11.1
Max : 110
Min : 0
Zeros : 8
Zeros(%) : 30.0%


Unnamed: 0,magnitude,counts
0,"(-850.0, 85000.0]",292
1,"(765000.0, 850000.0]",1
2,"(680000.0, 765000.0]",0
3,"(595000.0, 680000.0]",0
4,"(510000.0, 595000.0]",0
5,"(425000.0, 510000.0]",0
6,"(340000.0, 425000.0]",0
7,"(255000.0, 340000.0]",0
8,"(170000.0, 255000.0]",0
9,"(85000.0, 170000.0]",0


Statistics for : counts
Distinct : 3
Distinct(%) :30.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 29.3
Max : 292
Min : 0
Zeros : 8
Zeros(%) : 30.0%


Unnamed: 0,magnitude,counts
0,"(-4.5, 450.0]",74
1,"(450.0, 900.0]",7
2,"(4050.0, 4500.0]",2
3,"(900.0, 1350.0]",2
4,"(1800.0, 2250.0]",1
5,"(1350.0, 1800.0]",1
6,"(3600.0, 4050.0]",0
7,"(3150.0, 3600.0]",0
8,"(2700.0, 3150.0]",0
9,"(2250.0, 2700.0]",0


Statistics for : counts
Distinct : 5
Distinct(%) :50.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 8.7
Max : 74
Min : 0
Zeros : 4
Zeros(%) : 50.0%


Unnamed: 0,magnitude,counts
0,"(85.5, 149.0]",48
1,"(149.0, 212.5]",40
2,"(22.0, 85.5]",40
3,"(-41.5, 22.0]",34
4,"(212.5, 276.0]",24
5,"(276.0, 339.5]",20
6,"(339.5, 403.0]",10
7,"(-105.635, -41.5]",9
8,"(403.0, 466.5]",8
9,"(466.5, 530.0]",3


Statistics for : counts
Distinct : 9
Distinct(%) :90.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 23.6
Max : 48
Min : 3
Zeros : 0
Zeros(%) : 90.0%


Unnamed: 0,magnitude,counts
0,"(-243.6, 24360.0]",217
1,"(219240.0, 243600.0]",1
2,"(97440.0, 121800.0]",1
3,"(73080.0, 97440.0]",1
4,"(48720.0, 73080.0]",1
5,"(194880.0, 219240.0]",0
6,"(170520.0, 194880.0]",0
7,"(146160.0, 170520.0]",0
8,"(121800.0, 146160.0]",0
9,"(24360.0, 48720.0]",0


Statistics for : counts
Distinct : 3
Distinct(%) :30.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 22.1
Max : 217
Min : 0
Zeros : 5
Zeros(%) : 30.0%


Unnamed: 0,magnitude,counts
0,"(13.8, 15.1]",2
1,"(9.9, 11.2]",2
2,"(5.987, 7.3]",2
3,"(17.7, 19.0]",1
4,"(12.5, 13.8]",1
5,"(11.2, 12.5]",1
6,"(8.6, 9.9]",1
7,"(7.3, 8.6]",1
8,"(16.4, 17.7]",0
9,"(15.1, 16.4]",0


Statistics for : counts
Distinct : 3
Distinct(%) :30.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.1
Max : 2
Min : 0
Zeros : 2
Zeros(%) : 30.0%


Unnamed: 0,magnitude,counts
0,"(-27698.0, 9178.0]",196
1,"(46054.0, 82930.0]",3
2,"(-101450.0, -64574.0]",2
3,"(-212078.0, -175202.0]",2
4,"(-248954.0, -212078.0]",2
5,"(9178.0, 46054.0]",1
6,"(-138326.0, -101450.0]",1
7,"(-175202.0, -138326.0]",1
8,"(-286198.76, -248954.0]",1
9,"(-64574.0, -27698.0]",0


Statistics for : counts
Distinct : 5
Distinct(%) :50.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 20.9
Max : 196
Min : 0
Zeros : 1
Zeros(%) : 50.0%


Unnamed: 0,magnitude,counts
0,"(11.8, 14.2]",3
1,"(4.6, 7.0]",3
2,"(-0.2, 2.2]",3
3,"(14.2, 16.6]",2
4,"(9.4, 11.8]",2
5,"(7.0, 9.4]",2
6,"(2.2, 4.6]",2
7,"(16.6, 19.0]",1
8,"(-5.024, -2.6]",1
9,"(-2.6, -0.2]",0


Statistics for : counts
Distinct : 4
Distinct(%) :40.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.9
Max : 3
Min : 0
Zeros : 1
Zeros(%) : 40.0%


Unnamed: 0,magnitude,counts
0,"(-54.0, 5400.0]",68
1,"(48600.0, 54000.0]",1
2,"(5400.0, 10800.0]",1
3,"(43200.0, 48600.0]",0
4,"(37800.0, 43200.0]",0
5,"(32400.0, 37800.0]",0
6,"(27000.0, 32400.0]",0
7,"(21600.0, 27000.0]",0
8,"(16200.0, 21600.0]",0
9,"(10800.0, 16200.0]",0


Statistics for : counts
Distinct : 3
Distinct(%) :30.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 7.0
Max : 68
Min : 0
Zeros : 7
Zeros(%) : 30.0%


Unnamed: 0,magnitude,counts
0,"(-0.055, 5.5]",6
1,"(5.5, 11.0]",5
2,"(11.0, 16.5]",3
3,"(49.5, 55.0]",1
4,"(27.5, 33.0]",1
5,"(16.5, 22.0]",1
6,"(44.0, 49.5]",0
7,"(38.5, 44.0]",0
8,"(33.0, 38.5]",0
9,"(22.0, 27.5]",0


Statistics for : counts
Distinct : 5
Distinct(%) :50.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.7
Max : 6
Min : 0
Zeros : 4
Zeros(%) : 50.0%


Unnamed: 0,magnitude,counts
0,"(-900.0, 90000.0]",104
1,"(90000.0, 180000.0]",3
2,"(810000.0, 900000.0]",1
3,"(450000.0, 540000.0]",1
4,"(720000.0, 810000.0]",0
5,"(630000.0, 720000.0]",0
6,"(540000.0, 630000.0]",0
7,"(360000.0, 450000.0]",0
8,"(270000.0, 360000.0]",0
9,"(180000.0, 270000.0]",0


Statistics for : counts
Distinct : 4
Distinct(%) :40.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 10.9
Max : 104
Min : 0
Zeros : 6
Zeros(%) : 40.0%


Unnamed: 0,magnitude,counts
0,"(293.0, 614.0]",50
1,"(-31.21, 293.0]",50
2,"(614.0, 935.0]",18
3,"(935.0, 1256.0]",8
4,"(2861.0, 3182.0]",4
5,"(1577.0, 1898.0]",3
6,"(1898.0, 2219.0]",2
7,"(1256.0, 1577.0]",2
8,"(2540.0, 2861.0]",1
9,"(2219.0, 2540.0]",1


Statistics for : counts
Distinct : 7
Distinct(%) :70.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 13.9
Max : 50
Min : 1
Zeros : 0
Zeros(%) : 70.0%


Unnamed: 0,magnitude,counts
0,"(-5.6, 560.0]",36
1,"(1680.0, 2240.0]",3
2,"(560.0, 1120.0]",2
3,"(5040.0, 5600.0]",1
4,"(2240.0, 2800.0]",1
5,"(1120.0, 1680.0]",1
6,"(4480.0, 5040.0]",0
7,"(3920.0, 4480.0]",0
8,"(3360.0, 3920.0]",0
9,"(2800.0, 3360.0]",0


Statistics for : counts
Distinct : 5
Distinct(%) :50.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 4.4
Max : 36
Min : 0
Zeros : 4
Zeros(%) : 50.0%


## In the below cell, distribution of dates for top K properties is being found and also finding the statistics for the 'counts' column

In [47]:
df_time_data_type_year = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_time')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_date_year(v) as year, count(v) as value_count' \
            --where 'kgtk_date(v)' \
            --order-by 'count(v) desc' "
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.2." + ele["property"]+".data_types_year"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        df_time_data_type_year.append(temp)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.2.P575.data_types_year.tsv             --match  '(n1)-[r:P575]->(v)'             --return 'distinct kgtk_date_year(v) as year, count(v) as value_count'             --where 'kgtk_date(v)'             --order-by 'count(v) desc' 

[2020-10-20 11:54:51 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_date_year(graph_17_c1."node2") "year", count(graph_17_c1."node2") "value_count"
     FROM graph_17 AS graph_17_c1
     WHERE graph_17_c1."label"=?
     AND kgtk_date(graph_17_c1."node2")
     GROUP BY year
     ORDER BY count(graph_17_c1."node2") DESC
  PARAS: ['P575']
---------------------------------------------
        0.59 real         0.46 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.2.P2669.data_types_year.tsv             --match  '(n1)-[r:P2669]->(v)'             --retu

## In the below cell the each of the above generated table is loaded in the dataframe and also printing the statistics for the 'value_count' column

In [48]:
for ele in df_time_data_type_year:
    display(ele.head())
    find_statistics(ele,'value_count')

Unnamed: 0,year,value_count
0,1856,3
1,1847,2
2,1993,1
3,1984,1
4,1970,1


Statistics for : value_count
Distinct : 3
Distinct(%) :17.65%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.1764705882352942
Max : 3
Min : 1
Zeros : 0
Zeros(%) : 17.65%


Unnamed: 0,year,value_count
0,2013,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,year,value_count
0,1933,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,year,value_count
0,1961,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,year,value_count
0,1970,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


## In the below cell the distribution time based on time zone is generated for top K properties.

In [49]:
df_time_data_type_zone = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_time')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_date_zone_string(v) as zone, count(v) as value_count' \
            --where 'kgtk_date(v)' \
            --order-by 'count(v) desc'"
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.2." + ele["property"]+".data_types_zone"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        df_time_data_type_zone.append(temp)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.2.P575.data_types_zone.tsv             --match  '(n1)-[r:P575]->(v)'             --return 'distinct kgtk_date_zone_string(v) as zone, count(v) as value_count'             --where 'kgtk_date(v)'             --order-by 'count(v) desc'

[2020-10-20 11:55:46 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_date_zone_string(graph_17_c1."node2") "zone", count(graph_17_c1."node2") "value_count"
     FROM graph_17 AS graph_17_c1
     WHERE graph_17_c1."label"=?
     AND kgtk_date(graph_17_c1."node2")
     GROUP BY zone
     ORDER BY count(graph_17_c1."node2") DESC
  PARAS: ['P575']
---------------------------------------------
        0.58 real         0.45 user         0.10 sys

$kgtk query -i $WIKIDATA_PARTS/$time --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.2.P2669.data_types_zone.tsv             --match  '(n1)-[r:P2669]->(v)'      

## The file is loaded in the data frame and along with that I have also displayed the statistics for value_count

In [50]:
for ele in df_time_data_type_zone:
    display(ele)
    find_statistics(ele,'value_count')

Unnamed: 0,zone,value_count
0,Z,20


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 20.0
Max : 20
Min : 20
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,zone,value_count
0,Z,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,zone,value_count
0,Z,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,zone,value_count
0,Z,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


Unnamed: 0,zone,value_count
0,Z,1


Statistics for : value_count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


In [51]:
df_string_data_type_lang = []
try:
    df = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('property_summary_string')),delimiter='\t')
    cmd = "$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE \
            -o $OUTPUT_FOLDER/__output \
            --match  '(n1)-[r:__property]->(v)' \
            --return 'distinct kgtk_lqstring_lang(v) as language, count(v) as value_count' \
            --where 'kgtk_lqstring(v)' \
            --order-by 'count(v) desc'"
    for index, ele in df.iterrows():
        output_file = subset_name + ".2.2.13." + ele["property"]+".data_types_lang"+".tsv"
        run_command(cmd, {"__property": ele["property"],"__output": output_file})
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),output_file),delimiter='\t')
        df_string_data_type_lang.append(temp)
except Exception as e:
    print(e)

$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.13.P274.data_types_lang.tsv             --match  '(n1)-[r:P274]->(v)'             --return 'distinct kgtk_lqstring_lang(v) as language, count(v) as value_count'             --where 'kgtk_lqstring(v)'             --order-by 'count(v) desc'

[2020-10-20 11:56:20 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "language", count(graph_18_c1."node2") "value_count"
     FROM graph_18 AS graph_18_c1
     WHERE graph_18_c1."label"=?
     AND kgtk_lqstring(graph_18_c1."node2")
     GROUP BY language
     ORDER BY count(graph_18_c1."node2") DESC
  PARAS: ['P274']
---------------------------------------------
        1.90 real         1.13 user         0.28 sys

$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.13.P233.data_types_lang.tsv             --match  '(n1)-[r


[2020-10-20 11:56:30 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "language", count(graph_18_c1."node2") "value_count"
     FROM graph_18 AS graph_18_c1
     WHERE graph_18_c1."label"=?
     AND kgtk_lqstring(graph_18_c1."node2")
     GROUP BY language
     ORDER BY count(graph_18_c1."node2") DESC
  PARAS: ['P591']
---------------------------------------------
        0.73 real         0.55 user         0.13 sys

$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.13.P874.data_types_lang.tsv             --match  '(n1)-[r:P874]->(v)'             --return 'distinct kgtk_lqstring_lang(v) as language, count(v) as value_count'             --where 'kgtk_lqstring(v)'             --order-by 'count(v) desc'

[2020-10-20 11:56:31 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "langua


[2020-10-20 11:56:37 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "language", count(graph_18_c1."node2") "value_count"
     FROM graph_18 AS graph_18_c1
     WHERE graph_18_c1."label"=?
     AND kgtk_lqstring(graph_18_c1."node2")
     GROUP BY language
     ORDER BY count(graph_18_c1."node2") DESC
  PARAS: ['P593']
---------------------------------------------
        0.64 real         0.48 user         0.12 sys

$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.13.P875.data_types_lang.tsv             --match  '(n1)-[r:P875]->(v)'             --return 'distinct kgtk_lqstring_lang(v) as language, count(v) as value_count'             --where 'kgtk_lqstring(v)'             --order-by 'count(v) desc'

[2020-10-20 11:56:38 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "langua


[2020-10-20 11:56:44 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "language", count(graph_18_c1."node2") "value_count"
     FROM graph_18 AS graph_18_c1
     WHERE graph_18_c1."label"=?
     AND kgtk_lqstring(graph_18_c1."node2")
     GROUP BY language
     ORDER BY count(graph_18_c1."node2") DESC
  PARAS: ['P246']
---------------------------------------------
        0.61 real         0.47 user         0.11 sys

$kgtk query -i $WIKIDATA_PARTS/$string --graph-cache $STORE             -o $OUTPUT_FOLDER/Q11173.2.2.13.P2572.data_types_lang.tsv             --match  '(n1)-[r:P2572]->(v)'             --return 'distinct kgtk_lqstring_lang(v) as language, count(v) as value_count'             --where 'kgtk_lqstring(v)'             --order-by 'count(v) desc'

[2020-10-20 11:56:44 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_lqstring_lang(graph_18_c1."node2") "lang

In [None]:
for ele in df_string_data_type_lang:
    display(ele)
    ele.describe()

## 2.3 Item properties
### Top K qnodes used in node2
### qnode, label, count

## Header

### qnode -- qnode corresponding to node2 
### count -- number of times qnode is present 
### label -- label of the qnode



In [54]:
cmd  = "$kgtk query  -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$item_properties \
--match 'part: (n1)-[l]->(n2), label: (n2)-[:label]->(label)' \
--return 'distinct n2 as qnode, count(n2) as count, label ' \
--where '(label.kgtk_lqstring_lang_suffix = \"en\") AND (n2 != \"__subset_name\")' \
--order-by 'count(n2) desc'"
run_command(cmd, {"__subset_name":subset_name})

$kgtk query  -i $WIKIDATA_PARTS/$wikibase_item -i $WIKIDATA_PARTS/$label --graph-cache $STORE -o $OUTPUT_FOLDER/$item_properties --match 'part: (n1)-[l]->(n2), label: (n2)-[:label]->(label)' --return 'distinct n2 as qnode, count(n2) as count, label ' --where '(label.kgtk_lqstring_lang_suffix = "en") AND (n2 != "Q11173")' --order-by 'count(n2) desc'

[2020-10-20 11:57:27 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_20_c2."node1" "qnode", count(graph_20_c2."node1") "count", graph_20_c2."node2"
     FROM graph_19 AS graph_19_c1, graph_20 AS graph_20_c2
     WHERE graph_20_c2."label"=?
     AND graph_19_c1."node2"=graph_20_c2."node1"
     AND ((kgtk_lqstring_lang_suffix(graph_20_c2."node2") = ?) AND (graph_20_c2."node1" != ?))
     GROUP BY qnode
     ORDER BY count(graph_20_c2."node1") DESC
  PARAS: ['label', 'en', 'Q11173']
---------------------------------------------
        3.63 real         3.06 user         0.31 sys



## The file is loaded in the data frame and along with that I have also displayed the statistics for 'count' column

In [56]:
df_item = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('item_properties')),delimiter='\t')
display(df_item.head())
find_statistics(df_item,"count")

Unnamed: 0,qnode,count,node2
0,Q90481889,49433,'CAS COVID-19 Anti-Viral Candidate Compounds'@en
1,Q623,6174,'carbon'@en
2,Q629,5434,'oxygen'@en
3,Q12140,5048,'medication'@en
4,Q627,2418,'nitrogen'@en


Statistics for : count
Distinct : 191
Distinct(%) :1.15%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 11.902906208718626
Max : 49433
Min : 1
Zeros : 0
Zeros(%) : 1.15%


## 2.4 Time properties
### min time, max time
### Chart: x axis is different times, y axis is count of nodes that have the value. Binning maybe required


In [51]:
!$kgtk query  -i $WIKIDATA_PARTS/$time  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$time_properties \
--match 'part: (n1)-[l]->(time)' \
--return 'min(time) as min_time, max(time) as max_time'

[2020-10-13 13:01:52 query]: SQL Translation:
---------------------------------------------
  SELECT min(graph_5_c1."node2") "min_time", max(graph_5_c1."node2") "max_time"
     FROM graph_5 AS graph_5_c1
  PARAS: []
---------------------------------------------
min_time	max_time
^1669-00-00T00:00:00Z/9	^2013-05-00T00:00:00Z/10
        0.55 real         0.43 user         0.11 sys


## Count of top K distinct time 

## Header

### time -- time
### count --- number of times it is present



In [57]:
!$kgtk query  -i $WIKIDATA_PARTS/$time  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$time_properties_count \
--match 'part: (n1)-[l]->(time)' \
--return 'distinct time as `time`, count(time) as count' \
--order-by 'count(time) desc'

[2020-10-20 11:58:21 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_17_c1."node2" "time", count(graph_17_c1."node2") "count"
     FROM graph_17 AS graph_17_c1
     GROUP BY time
     ORDER BY count(graph_17_c1."node2") DESC
  PARAS: []
---------------------------------------------
        0.59 real         0.44 user         0.11 sys


## The file is loaded in the data frame and along with that I have also displayed the statistics

In [59]:
df_time = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('time_properties_count')),delimiter='\t')
display(df_time.head())
df_time.describe()
find_statistics(df_time,"count")

Unnamed: 0,time,count
0,^1856-00-00T00:00:00Z/9,2
1,^1847-00-00T00:00:00Z/9,2
2,^2013-05-00T00:00:00Z/10,1
3,^1993-00-00T00:00:00Z/9,1
4,^1987-00-00T00:00:00Z/9,1


Statistics for : count
Distinct : 2
Distinct(%) :8.7%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0869565217391304
Max : 2
Min : 1
Zeros : 0
Zeros(%) : 8.7%


## 2.5 Quantity properties
### list of units and counts
### Chart: x axis is different magnitudes, y axis is count of nodes that have the value. Binning is required

## Header
### units -- units
### counts -- count corresponding to each unit


In [60]:
!$kgtk query  -i $WIKIDATA_PARTS/$quantity -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$quantity_properties \
--match 'part: (n1)-[l{label: llab}]->(n2), label: (llab)-[:label]->(label)' \
--return 'distinct label as units, count(llab) as counts'\
--where 'label.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(llab) desc'

[2020-10-20 11:59:27 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_20_c2."node2" "units", count(graph_16_c1."label") "counts"
     FROM graph_16 AS graph_16_c1, graph_20 AS graph_20_c2
     WHERE graph_16_c1."label"=graph_16_c1."label"
     AND graph_20_c2."label"=?
     AND graph_16_c1."label"=graph_20_c2."node1"
     AND (kgtk_lqstring_lang_suffix(graph_20_c2."node2") = ?)
     GROUP BY units
     ORDER BY count(graph_16_c1."label") DESC
  PARAS: ['label', 'en']
---------------------------------------------
        1.60 real         1.42 user         0.13 sys


## The file is loaded in the data frame and along with that I have also displayed the statistics for counts column

In [61]:
df_quantity = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('quantity_properties')),delimiter='\t')
display(df_quantity.head())
find_statistics(df_quantity,"counts")

Unnamed: 0,units,counts
0,'mass'@en,146546
1,'melting point'@en,9494
2,'density'@en,1113
3,'defined daily dose'@en,908
4,'boiling point'@en,873


Statistics for : counts
Distinct : 42
Distinct(%) :67.74%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 2661.5967741935483
Max : 146546
Min : 1
Zeros : 0
Zeros(%) : 67.74%


## 2.6 String, Monolingual, url, External id, Math musical notation
### number of distinct values
### list of top M values

## Header
### value -- value
### number_of_statements -- number of statements corresponding to that value.



In [63]:
types = [
    ("math","distinct_values_math"),
    ("string","distinct_values_string"),
    ("external_id","distinct_values_external_id"),
    ("monolingualtext","distinct_values_monolingual_text"),
    ("musical_notation","distinct_values_musical_notation"),
]

cmd = "$kgtk query  -i $WIKIDATA_PARTS/$TYPE_FILE  --graph-cache $STORE \
-o $OUTPUT_FOLDER/$output_file \
--match 'part: (n1)-[l]->(n2)' \
--return 'distinct n2 as value, count(n2) as number_of_statements' \
--order-by 'count(n2) desc'"

for type, output_file in types:
    run_command(cmd, {"TYPE_FILE": type,"output_file":output_file})

$kgtk query  -i $WIKIDATA_PARTS/$math  --graph-cache $STORE -o $OUTPUT_FOLDER/$distinct_values_math --match 'part: (n1)-[l]->(n2)' --return 'distinct n2 as value, count(n2) as number_of_statements' --order-by 'count(n2) desc'

[2020-10-20 12:00:14 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_22_c1."node2" "value", count(graph_22_c1."node2") "number_of_statements"
     FROM graph_22 AS graph_22_c1
     GROUP BY value
     ORDER BY count(graph_22_c1."node2") DESC
  PARAS: []
---------------------------------------------
        0.57 real         0.44 user         0.11 sys

$kgtk query  -i $WIKIDATA_PARTS/$string  --graph-cache $STORE -o $OUTPUT_FOLDER/$distinct_values_string --match 'part: (n1)-[l]->(n2)' --return 'distinct n2 as value, count(n2) as number_of_statements' --order-by 'count(n2) desc'

[2020-10-20 12:00:15 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_18_c1."node2" "value", co

## I have loaded all the files in dataframe and also displayed the statistics for 'number_of_statemens' for each of the dataframe

In [64]:
types = [
    ("math","distinct_values_math"),
    ("string","distinct_values_string"),
    ("external_id","distinct_values_external_id"),
    ("monolingualtext","distinct_values_monolingual_text"),
    ("musical_notation","distinct_values_musical_notation"),
]
df_distinct_values = []
for type, output_file in types:
    try :
        temp = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv(output_file)),delimiter='\t')
    except Exception as e:
        continue
    df_distinct_values.append(temp)
    display(df_distinct_values[-1].head())
    find_statistics(temp,"number_of_statements")

Unnamed: 0,value,number_of_statements


Statistics for : number_of_statements
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


Unnamed: 0,value,number_of_statements
0,,890
1,C₆₁H₁₀₆O₆,247
2,C₆₁H₁₀₄O₆,245
3,C₆₃H₁₁₀O₆,239
4,C₆₁H₁₀₂O₆,236


Statistics for : number_of_statements
Distinct : 162
Distinct(%) :0.04%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.5876895739493369
Max : 890
Min : 1
Zeros : 0
Zeros(%) : 0.04%


Unnamed: 0,value,number_of_statements
0,novalue,32
1,3077,24
2,1993,20
3,5J7L,19
4,5J5B,19


Statistics for : number_of_statements
Distinct : 22
Distinct(%) :0.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0308421263930696
Max : 32
Min : 1
Zeros : 0
Zeros(%) : 0.0%


Unnamed: 0,value,number_of_statements
0,'zopiclone'@en,2
1,'tomoxetine'@en,2
2,'sulfamoxole'@en,2
3,'pembrolizumab'@en,2
4,'doxycycline'@en,2


Statistics for : number_of_statements
Distinct : 2
Distinct(%) :0.08%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0061753808151503
Max : 2
Min : 1
Zeros : 0
Zeros(%) : 0.08%


Unnamed: 0,value,number_of_statements


Statistics for : number_of_statements
Distinct : 0
Distinct(%) :nan%
Missing : 0
Missing(%) : nan%
Infinite : 0
Infinite(%) : nan%
Mean : nan
Max : nan
Min : nan
Zeros : 0
Zeros(%) : nan%


  """
  if __name__ == '__main__':
  del sys.path[0]


## 2.7 Globe coordinate
### map with top M, put circles with radius proportional to the number of nodes that have the coordinate 

## Header
### Coordinate -- Coordinate
### Count -- Number of instances of the coordinate


In [67]:
!$kgtk query  -i $WIKIDATA_PARTS/$globe_coordinate -i $WIKIDATA_PARTS/$label --graph-cache $STORE \
-o $OUTPUT_FOLDER/$globe_coordinate_top_m \
--match 'part: (n1)-[l]->(n2)' \
--return 'distinct n2 as Coordinate, count(n2) as count'\
--order-by 'count(n2) desc'

[2020-10-20 12:03:08 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_26_c1."node2" "Coordinate", count(graph_26_c1."node2") "count"
     FROM graph_26 AS graph_26_c1
     GROUP BY Coordinate
     ORDER BY count(graph_26_c1."node2") DESC
  PARAS: []
---------------------------------------------
        0.59 real         0.45 user         0.11 sys


## I have loaded all the files in dataframe and also displayed the statistics for 'count' for each of the dataframe

In [69]:
df_coordinate = pd.read_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('globe_coordinate_top_m')),delimiter='\t')
display(df_coordinate.head())
find_statistics(df_coordinate,"count")


Unnamed: 0,Coordinate,count
0,@37.25/27,1


Statistics for : count
Distinct : 1
Distinct(%) :100.0%
Missing : 0
Missing(%) : 0.0%
Infinite : 0
Infinite(%) : 0.0%
Mean : 1.0
Max : 1
Min : 1
Zeros : 0
Zeros(%) : 100.0%


## 2.8 geo-shape
### random sample of M nodes that have the property

## Header
### node1 -- node1
### label -- label
### node2 -- node2


In [80]:
try:
    df = pd.read_csv(os.path.join(os.getenv('WIKIDATA_PARTS'),os.getenv('geo_shape')),delimiter='\t',index_col=False)
    try:
        num_rows = min(int(K[1:-1]),len(df))
    except Exception as e:
        num_rows = min(int(K),len(df))
    df.sample(n=num_rows)
    display(df)
    df.to_csv(os.path.join(os.getenv('OUTPUT_FOLDER'),os.getenv('geo_shape_top_m')),sep='\t',index=False)
except Exception as e:
    print(e)

Unnamed: 0,id,node1,label,node2
