# Analyze ORKG taxonomy

This notebook analyzes the ORKG taxonomy and checks for duplicate labels and paths.
The taxonomy is available at data/taxonomy.csv and originates from [here](https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-research-fields-classifier/-/blob/master/data/raw/mappings/taxonomy.csv).

In [1]:
import os
import pandas as pd

base_dir = "praktikum-ise-2023-patrick-zierahn"
taxonomy_df = pd.read_csv(os.path.join(base_dir, "references", "orkg_taxonomy.csv"))
taxonomy_df

Unnamed: 0,1,2,3,4,5
0,Arts and Humanities,American Studies,American Film Studies,,
1,Arts and Humanities,American Studies,American Material Culture,,
2,Arts and Humanities,American Studies,American Popular Culture,,
3,Arts and Humanities,American Studies,Ethnic Studies,,
4,Arts and Humanities,Classics,Ancient History (Greek and Roman through Late ...,,
...,...,...,...,...,...
606,Social and Behavioral Sciences,Sociology,Social Psychology and Interaction,,
607,Social and Behavioral Sciences,Sociology,Sociology of Culture,,
608,Social and Behavioral Sciences,Sociology,"Theory, Knowledge and Science",,
609,Social and Behavioral Sciences,Sociology,"Work, Economy and Organizations",,


## Check for duplicate labels

The taxonomy contains many **labels that are ambiguous**, i.e. they are used for multiple research fields.
For example: The label "American Studies" can be classified under "Arts and Humanities" and under "Social and Behavioral Sciences, Social and Cultural Anthropology and Ethnology"

In [2]:
labels = set()
label_path = {}
significant_labels = set()

for inx, row in taxonomy_df.iterrows():
    ancestors = []
    cleanedList = [label for label in row if not pd.isna(label)]

    significant_label = cleanedList[-1]
    if significant_label not in significant_labels:
        significant_labels.add(cleanedList[-1])
    else:
        print("Duplicate label:", significant_label)

    for label in cleanedList:
        labels.add(label)
        ancestors.append(label)

        # Compare with existing label paths
        if label in label_path and label_path[label] != ancestors:
            print("Different paths for label:", label)
            print("Existing path:", label_path[label])
            print("New path:", ancestors)
            print("-----------------------------------")

        # Need to copy data because Python sucks
        label_path[label] = ancestors.copy()

print("Number of labels:", len(labels))
#print_json("label_path", label_path)
print("Number of significant labels:", len(significant_labels))

Different paths for label: Musicology
Existing path: ['Arts and Humanities', 'Musicology']
New path: ['Arts and Humanities', 'Musicology', 'Musicology']
-----------------------------------
Different paths for label: Musicology
Existing path: ['Arts and Humanities', 'Musicology', 'Musicology']
New path: ['Arts and Humanities', 'Musicology']
-----------------------------------
Duplicate label: Rhetoric and Composition
Different paths for label: Rhetoric and Composition
Existing path: ['Arts and Humanities', 'English Language and Literature', 'Rhetoric and Composition']
New path: ['Arts and Humanities', 'Rhetoric and Composition']
-----------------------------------
Different paths for label: Computer Engineering
Existing path: ['Engineering', 'Computer Engineering']
New path: ['Engineering', 'Electrical and Computer Engineering', 'Computer Engineering']
-----------------------------------
Duplicate label: Energy Process Engineering
Different paths for label: Energy Process Engineering
Ex

In [3]:
from pyvis.network import Network 

net = Network(height="750px", width="100%", bgcolor="#222222", font_color="white", notebook=True, cdn_resources='remote')

# Build network from taxonomy
for inx, row in taxonomy_df.iterrows():
    cleanedList = [label for label in row if not pd.isna(label)]

    ancestors = ""
    for label in cleanedList:
        if label not in net.nodes:
            net.add_node(label, label, title=label, color="#00ff00")
        
        if ancestors != "":
            net.add_edge(ancestors, label)
        ancestors = label
        
net.show("orkg_taxonomy.html")

orkg_taxonomy.html


In [4]:
import numpy as np

df = pd.read_csv('data/orkg_with_abstracts.csv')
df["doi"] = df.doi.apply(eval).apply(np.array)  # convert string to array
df["subfields"] = df.subfields.apply(eval).apply(np.array)  # convert string to array
df = df.fillna('')

In [5]:
# Get all values from the subfields array from df
subfields = set()

for inx, row in df.iterrows():
    for subfield in row["subfields"]:
        subfields.add(subfield)

print("Number of subfields:", len(subfields))

Number of subfields: 692


In [6]:
# Print symmetric difference between subfields and labels
print("Subfields not in labels:", subfields - labels)
print("Labels not in subfields:", labels - subfields)

Subfields not in labels: {'Sociocentric networks', 'Computer-Aided Design of Materials and Simulation of Materials Behaviour from  Atomic to Microscopic Scale', 'Quantitative Methods)', 'Women’s History', 'Ego-centric networks', 'Production Systems, Operations Management, Quality Management and Factory  Planning'}
Labels not in subfields: {'Women‘s History', 'Computer-Aided Design of Materials and Simulation of Materials Behaviour from Atomic to Microscopic Scale', 'Quantitative Methods', 'Production Systems, Operations Management, Quality Management and Factory Planning'}
