## MAG matching

The following notebook is  used to find matching entries between the Microsoft Academic Graph and the CORE dataset

In [9]:
from glob import glob
from json import loads, dump, load
from os.path import basename, exists, isfile, isdir, sep
from os import makedirs, listdir, getcwd
from datetime import datetime
input_path = "core_sets/english_language/2021_04_07_17_27_13/batches"

In [2]:
from pyspark import SparkContext
sc = SparkContext(master = 'local[16]')

In [3]:
sc

In [4]:
if isdir(input_path):
    input_filepaths = sorted(glob(input_path + sep + "*"))

Entries can either be matches by their title or their doi. Additionally, the publication year and the authorlist are loaded to compare control possible matches with those information.
A Levenshtein compare method is established which will be used to compare titles of documents. 
Returns true if the input strings are very similar.
A small distance is acceptable, also long substrings of another will be considered as sufficiently similar.

In [5]:
def id_to_auth_doi(line):
    return (line['coreId'], line['title'], line['authors'], line['doi'], line['year'])

In [1]:
from Levenshtein import distance as levenshtein_distance
def levenshtein_compare(t1, t2):
    if not t1 or not t2:
        return False
    return (levenshtein_distance(t1.lower(), t2.lower()) < abs(len(t1)-len(t2)) + 0.1*min(len(t1),len(t2)) and not ((len(t1) < 20 or len(t2) < 20) and abs(len(t1)-len(t2)) > 30))

data is loaded from the CORE directory

In [7]:
print(datetime.now())
all_data = []
for path in input_filepaths:
    print(basename(path))
    rdd = sc.textFile(f"file://{getcwd()}/{path}")
    data = rdd.map(lambda line: id_to_auth_doi(loads(line)))
    all_data = all_data + data.collect()
    print(datetime.now())
dump({"all_data":all_data},open("results/mag_match/matching_data_complete.json","w+"))

2021-04-08 19:28:22.625175
001
2021-04-08 19:28:56.241732
002
2021-04-08 19:29:10.501211
003
2021-04-08 19:29:32.244258
004
2021-04-08 19:30:10.262153
005
2021-04-08 19:30:34.274285
006
2021-04-08 19:31:10.066201
007
2021-04-08 19:31:40.087070
008
2021-04-08 19:31:41.621784
009
2021-04-08 19:31:43.162061
010
2021-04-08 19:32:02.766443
011
2021-04-08 19:32:34.910860
012
2021-04-08 19:33:05.096864
013
2021-04-08 19:33:34.013624
014
2021-04-08 19:34:15.324806
015
2021-04-08 19:34:30.124443
016
2021-04-08 19:34:41.912931
017
2021-04-08 19:35:21.373632
018
2021-04-08 19:35:38.001168
019
2021-04-08 19:36:06.515262
020
2021-04-08 19:36:08.370141
021
2021-04-08 19:36:32.407680
022
2021-04-08 19:36:55.727786
023
2021-04-08 19:37:50.434034
024
2021-04-08 19:37:58.307383
025
2021-04-08 19:38:30.837643
026
2021-04-08 19:38:33.640238
027
2021-04-08 19:39:07.170664
028
2021-04-08 19:39:19.527213
029
2021-04-08 19:40:30.075409
030
2021-04-08 19:41:08.535386
031
2021-04-08 19:41:23.441667
032
2021-04-

In [2]:
from json import load
all_data = load(open("results/mag_match/matching_data_complete.json","r"))['all_data']

A dictionary with all dois in the CORE set is created, so all data linked with this doi can be accessed via the dictionary. The same thing as with dois is done with titles. All entries with a certain title are stored in a dictionary with the title as key.

For even faster access, a set of all the dois and all titles is created.

In [3]:
doi_dict = {}
no_doi = 0
invalid_doi = 0
doi_set = set([entry[3].replace(' ','') if entry[3] else None for entry in all_data])

title_set = set([entry[1] for entry in all_data])
title_set.remove(None)
doi_set.remove(None)
auth_dict = {}

In [4]:
"10.1016/j.jcmg.2011.07.003" in doi_set

True

In [5]:
query = ("10.1016/j.jcmg.2011.07.003","10.1016/j.jacc.2010.07.010","10.1016/j.jacc.2004.02.025")
for entry in all_data:
    if entry[3] in query:
        print(entry)

['82716629', 'Surgery, angioplasty, or medical therapy for symptomatic multivessel coronary artery disease Is there an indisputable “winning strategy” from evidence-based clinical trials?**Editorials published in the Journal of the American College of Cardiologyreflect the views of the authors and do not necessarily represent the views of JACCor the American College of Cardiology.', ['Boden, William E'], '10.1016/j.jacc.2004.02.025', 2004]
['82328339', 'Ranolazine and Its Anti-Ischemic Effects Revisiting an Old Mechanistic Paradigm Anew?⁎⁎Editorials published in the Journal of the American College of Cardiologyreflect the views of the authors and do not necessarily represent the views of JACCor the American College of Cardiology.', ['Boden, William E.'], '10.1016/j.jacc.2010.07.010', 2010]
['82348440', 'Is Myocardial Perfusion Imaging an Important Predictor of Mortality in Women And if So, Is This Likely Cost Effective?⁎⁎Editorials published in JACC: Cardiovascular Imaging reflect the 

In [6]:
from re import match


no_title = 0
for entry in all_data:
    if not entry[1]:
        no_title += 1
    elif not entry[1] in auth_dict:
        auth_dict[entry[1]] = [entry]
    else:
        auth_dict[entry[1]] = auth_dict[entry[1]] + [entry]
        
    if not entry[3]:
        no_doi += 1
        continue
    else:
        if match(r"10.\d{4,9}/[-\._;\(\)/:a-zA-Z0-9]+ *",entry[3]).group() != entry[3]:
            invalid_doi += 1        
        if not entry[3] in doi_dict:
            doi_dict[entry[3]] = [entry]
        else:
            doi_dict[entry[3]] = doi_dict[entry[3]] + [entry]

print("Len all: ",len(all_data))
print("Len title_set: ",len(title_set))
print("Len doi_set: ",len(doi_set))
print(f"No title: {no_title}")
print(f"no DOI: {no_doi}")
print(f"invalid DOI: {invalid_doi}")

Len all:  6531442
Len title_set:  5746594
Len doi_set:  2282164
No title: 7518
no DOI: 3902844
invalid DOI: 37


Now we go through all 208,915,369 entries of the Microsoft Academic Graph.
If title or doi of the entry are present in the respective sets created above, the MAG entry might be a match for one of our CORE entries. Only entries with **exactly** the same title or doi are matched.

If a potential match is found, all the CORe entries with the matching title or doi are accessed via the dictionary created before.
For matches via a doi, the match is confirmed if also the titles of the two entries are consiered similar by the levenshtein-compare method defined above.

For matches via the title, the match is confirmed if both entries were published in the same year and at least one author can be found in both entries.


If a match is confirmed, the ids of both entries are added to a list of matches and the MAG entry is written to an output file.

In [3]:
oag_path = "/home/jovyan/mnt/ceph/storage/corpora/corpora-thirdparty/corpus-microsoft-open-academic-graph/"

In [10]:
from os.path import join
output_path = "mag_sets/mag_in_core2"
matches = []

with open(output_path, "w+") as output_file:
    doimatch = 0
    titlematch = 0
    titlematch_no_auth = 0
    print(datetime.now())
    for j in range(11):
        print(j)
        with open(join(oag_path, "mag_papers_" + str(j) +  ".txt"), "r") as f:
            for i, line in enumerate(f):
                if i%10000000 == 0: print(i)
                myjson = loads(line)
                if 'doi' in myjson and myjson['doi'] in doi_set:
                    for entry in doi_dict[myjson['doi']]:
                        if myjson['title'] == entry[1] or levenshtein_compare(myjson['title'], entry[1]):
                            dump(myjson, output_file)
                            output_file.write("\n")
                            doimatch += 1
                            matches.append((myjson['id'],entry[0]))
                            break         
                elif 'title' in myjson and myjson['title'] in title_set:
                    if 'year' in myjson:
                        for entry in auth_dict[myjson['title']]:
                            if myjson['year'] == entry[4] and 'authors' in myjson and len(myjson['authors']) > 0:
                                for authn in myjson['authors']:
                                    auth = authn['name'].lower().split()
                                    lname = auth[-1]
                                    fname = auth[0]
                                    if any([True if isinstance(n,str) and (n.lower().find(lname)>-1 or n.lower().find(fname)>-1) else False for n in entry[2]]):
                                        dump(myjson, output_file)
                                        output_file.write("\n")
                                        titlematch += 1
                                        matches.append((myjson['id'],entry[0]))
                                        break
                                else:
                                    continue
                                break
                                          
dump({"matches":matches},open("results/mag_match/matches.json","w+"))
print(datetime.now())                           
print(f"Matched by title, year and at least one author: {titlematch}")
print(f"Matched by DOI and title: {doimatch}")

2021-04-09 08:27:01.168971
0
0
10000000
20000000
1
0
10000000
2
0
10000000
3
0
10000000
4
0
10000000
20000000
5
0
10000000
20000000
6
0
10000000
7
0
10000000
20000000
8
0
10000000
20000000
9
0
10000000
20000000
10
0


FileNotFoundError: [Errno 2] No such file or directory: 'results/mag_match/matches.json'

In [4]:
from os.path import join

i = 0
for j in range(11):
    print(j)
    with open(join(oag_path, "mag_papers_" + str(j) +  ".txt"), "r") as f:
        for line in f:
            i+=1
print(f'Total of entries in OAG: {i}')

0
1
2
3
4
5
6
7
8
9
10
Total of entries in OAG: 208915369


The matches found are written to a file for processing later in the merging process

In [11]:
dump({"matches":matches},open("results/mag_match/matches.json","w+"))
print(datetime.now())                           
print(f"Matched by title, year and at least one author: {titlematch}")
print(f"Matched by DOI and title: {doimatch}")

2021-04-09 10:16:13.416659
Matched by title, year and at least one author: 1486254
Matched by DOI and title: 2080977


## Results check
We make some analyses on the matches created and check for potential problems.
We especially look at how many entries have more than one match and check specific entries that caused problem in earlier runs of the matching.

In [12]:
from json import load
matches = load(open("results/mag_match/matches.json","r"))['matches']

In [13]:
matches[0]

('1000000185', '42023535')

In [14]:
mag_ids = [x[0] for x in matches]
core_ids = [x[1] for x in matches]
print(len(mag_ids)," ",len(core_ids))

3567231   3567231


In [15]:
print(len(set(mag_ids))," ",len(set(core_ids)))

3567231   3508509


In [16]:
import pandas as pd

match_frame = pd.DataFrame({"mag":mag_ids,"core":core_ids})

In [17]:
y = match_frame['core'].value_counts()
y[y > 1]

81276629    14
81073403    12
82262567    12
81789836    11
81778558    11
            ..
38931941     2
78103392     2
12025884     2
2094021      2
11470177     2
Name: core, Length: 50343, dtype: int64

In [18]:
x = match_frame['mag'].value_counts()
x[x > 1]

Series([], Name: mag, dtype: int64)

In [19]:
for x in query:
    print(x in mag_ids)

False
False
False


In [20]:
for x in ['939130355','2198205421','2207969244','2208114951']:
    print(x in mag_ids)

True
True
True
True


In [24]:
with open(output_path) as f:
    for i,line in enumerate(f):
        j = loads(line)
        if j['id'] == '939130355':
            result = j
            break
print(datetime.now())

2021-04-09 10:41:06.685782


In [26]:
our_doi = '10.5235/20504721.1.1.15'
if 'doi' in result and result['doi'] == our_doi:
    for entry in doi_dict[result['doi']]:
        if result['title'] == entry[1] or levenshtein_compare(result['title'], entry[1]):
            print("Matched by title: ",result['title'], entry[1])
            break         
elif 'title' in result and result['title'] in title_set:
    if 'year' in result:
        for entry in auth_dict[result['title']]:
            if result['year'] == entry[4] and 'authors' in entry and len(entry['authors']) > 0:
                for authn in entry['authors']:
                    auth = authn['name'].lower().split()
                    lname = auth[-1]
                    fname = auth[0]
                    if any([True if isinstance(n,str) and (n.lower().find(lname)>-1 or n.lower().find(fname)>-1) else False for n in entry[2]]):
                        print("Titlematch")
                        break
                    else:
                        continue
                    break

Matched by title:  More Words on Words More words on words
