Analyse the hierarchy of the retrieved OpenAlex subjects

In [4]:
import json
from collections import Counter

In [3]:
subjects_file = '../data/openalex/subjects.json'
subjects = json.load(open(subjects_file))
len(subjects)

2157

In [6]:
level_cnt = Counter(v['level'] for v in subjects.values())
level_cnt  # number of subjects per level

Counter({0: 19, 1: 25, 2: 1999, 3: 108, 4: 6})

For each level, how popular is each path to the root? With path I mean which levels are present in the list of ancestors.

In [15]:
def get_path(ancestors):
  """ Return the set of levels present in the ancestor list. """
  levels = set([a['level'] for a in ancestors])
  return str(sorted(levels))

In [16]:
def count_paths(level):
  """ Count paths for a given level. """
  paths = Counter()
  for subject in subjects.values():
    if subject['level'] == level:
      path = get_path(subject['ancestors'])
      paths[path] += 1
  return paths

In [17]:
lv2 = count_paths(2)
lv2

Counter({'[0, 1]': 1997, '[0]': 2})

In [18]:
count_paths(3), count_paths(4)

(Counter({'[0, 1]': 11, '[0, 1, 2]': 97}), Counter({'[0, 1, 2, 3]': 6}))

Most of the subjects (all but 13) have complete paths up to the root. Which subjects of levels 2 and three don't?

In [25]:
def incomplete_paths(level):
  incomplete = {}
  for id, subject in subjects.items():
    if subject['level'] == level:
      path = get_path(subject['ancestors'])
      if path != str(list(range(level))):
        incomplete[id] = subject['name']
  return incomplete

In [26]:
incomplete_paths(2)

{'https://openalex.org/C21200559': 'Sensitivity (control systems)',
 'https://openalex.org/C62354387': 'Boundary (topology)'}

In [27]:
incomplete_paths(3)

{'https://openalex.org/C71405471': 'Quality management',
 'https://openalex.org/C106436119': 'Quality assurance',
 'https://openalex.org/C106906290': "Cronbach's alpha",
 'https://openalex.org/C49453240': 'Construct validity',
 'https://openalex.org/C40722632': 'Confirmatory factor analysis',
 'https://openalex.org/C24756922': 'Data quality',
 'https://openalex.org/C138897024': 'Total quality management',
 'https://openalex.org/C165957694': 'Exploratory factor analysis',
 'https://openalex.org/C2776950860': 'Originality',
 'https://openalex.org/C71760877': 'Cultural identity',
 'https://openalex.org/C98447023': 'Social identity theory'}