Analyse the subjects extracted in the file `get_subjects.py`.

We have tried different values for the `works_count` lower limit, to strike a balance between picking subjects of the upper levels, which are more general, and picking subjects that are popular. The problem is that different fields have different popularity. For example, medicine has more than 200 subjects in the third level with more than 25k works, whereas environmental science only has 31 across all levels. Therefore, we have started with a larger limit, iterated over all levels and all fields, and decreased the limit before iterating again. The iterations start at 25k works and go down to 50 subjects.

In [148]:
import json

In [149]:
subjects = json.load(open('../data/openalex/subjects.json'))

In [150]:
len(subjects)  # no. of subjects

2161

In [151]:
level_counts = {}  # no. of subjects by level
for subject in subjects.values():
  if subject['level'] in level_counts:
    level_counts[subject['level']] += 1
  else:
    level_counts[subject['level']] = 1
level_counts

{0: 19, 1: 25, 2: 2003, 3: 108, 4: 6}

In [152]:
field_counts = {}  # no. of subject per field (excluding fields)
for subject in subjects.values():
  if subject['level'] == 0:
    field_counts[subject['name']] = 0
for subject in subjects.values():
  if subject['level'] != 0:
    for ancestor in subject['ancestors']:
      if ancestor['display_name'] in field_counts:
        field_counts[ancestor['display_name']] += 1
field_counts

{'Medicine': 326,
 'Chemistry': 268,
 'Biology': 492,
 'Computer science': 254,
 'Materials science': 214,
 'Engineering': 299,
 'Psychology': 260,
 'Physics': 344,
 'Political science': 327,
 'Mathematics': 241,
 'Business': 134,
 'Sociology': 153,
 'Geography': 244,
 'Art': 117,
 'Environmental science': 103,
 'Economics': 248,
 'Geology': 247,
 'History': 170,
 'Philosophy': 219}

In [153]:
works_cnt = [s['works_count'] for s in subjects.values()]
sum(works_cnt)/len(works_cnt)  # avg. number of works it is tagged on

197163.68671911152

In [154]:
sum([cnt == 0 for cnt in works_cnt])  # no. of subjects that don't have works

0

In [155]:
for subject in subjects.values():
  if subject['works_count'] == 0:
    print(subject)