## Author merging notebook

This notebook was used to check the quality of matches generated with mag_matching.ipynb  
Matches were generated (mainly) based on DOIs, title and year. The minimum additional requirement was for one author to be found in both entries. This notebbok checks whether this has lead to accidental incorrect matches.

In [3]:
from glob import glob
from json import loads, dump, load
from os.path import basename, exists, isfile, isdir, sep
from os import makedirs, listdir, getcwd
from datetime import datetime
import pandas as pd

Set input paths for the CORE and MAG set and load a list of all matches created by the mag_matching.ipynb notebook.

In [1]:
mag_path = "mag_sets/mag_in_core"
core_path = "core_sets/mag_matched/2021_04_09_12_08_34/batches"
mlist = load(open("results/mag_match/matches.json","r"))['matches']
if isdir(core_path):
    core_filepaths = sorted(glob(core_path + sep + "*"))

Define a function to load id, title and authors from all matched anetries. Start spark context, then apply the function to all matched entries and store results in separate lists

In [3]:
def id_auth(line):
    j = loads(line)
    if 'id' in j:
        return (j['id'],j['authors'],j['title'])
    else:
        return (j['coreId'],j['authors'],j['title'])

In [2]:
from pyspark import SparkContext
sc = SparkContext(master = 'local[8]')

In [4]:
print(datetime.now())
mag_auth = []
with open(mag_path) as f:
    for i,line in enumerate(f):
        if i%1000000 == 0:
            print(i,": ",datetime.now())
        mag_auth.append(id_auth(line))
print(datetime.now())

2021-04-12 12:40:07.937312
0 :  2021-04-12 12:40:07.961380
1000000 :  2021-04-12 12:40:31.613487
2000000 :  2021-04-12 12:41:00.158131
3000000 :  2021-04-12 12:41:28.512934
2021-04-12 12:41:41.564202


In [6]:
print(datetime.now())
core_auth = []
for path in core_filepaths:
    print(basename(path))
    rdd = sc.textFile(f"file://{getcwd()}/{path}")
    data = rdd.map(lambda line: id_auth(line))
    core_auth = core_auth + data.collect()
    print(datetime.now())

2021-04-12 12:41:46.448253
001
2021-04-12 12:41:58.727182
002
2021-04-12 12:42:04.537347
003
2021-04-12 12:42:12.322149
004
2021-04-12 12:42:22.204538
005
2021-04-12 12:42:27.988688
006
2021-04-12 12:42:45.112159
007
2021-04-12 12:42:57.964316
008
2021-04-12 12:42:58.785696
009
2021-04-12 12:42:59.509462
010
2021-04-12 12:43:10.531643
011
2021-04-12 12:43:27.629680
012
2021-04-12 12:43:40.925310
013
2021-04-12 12:43:58.820433
014
2021-04-12 12:44:23.308718
015
2021-04-12 12:44:31.234184
016
2021-04-12 12:44:35.544872
017
2021-04-12 12:44:55.607287
018
2021-04-12 12:45:05.365178
019
2021-04-12 12:45:20.854344
020
2021-04-12 12:45:21.099376
021
2021-04-12 12:45:42.881556
022
2021-04-12 12:45:54.428403
023
2021-04-12 12:46:22.552479
024
2021-04-12 12:46:27.118037
025
2021-04-12 12:46:45.492848
026
2021-04-12 12:46:46.807423
027
2021-04-12 12:46:55.743782
028
2021-04-12 12:47:02.748067
029
2021-04-12 12:48:03.185817
031
2021-04-12 12:48:10.855187
032
2021-04-12 12:48:27.103498
033
2021-04-

Create a pandas dataframe for the MAG and CORE entries, each with aolumn for id, title and authors

In [5]:
import pandas as pd

mag_auth = pd.DataFrame(mag_auth, columns =['Mag_id', 'Mag_Authors','Mag_Title'])

In [None]:
core_auth = pd.DataFrame(core_auth, columns =['Core_id', 'Core_Authors','Core_Title'])

Merge both dataframes based on the list of entry matches. Each match is now a line in the dataframe, containing both author-lists and titles from the MAG and the CORE entry

In [8]:
auth_df = pd.DataFrame(mlist,columns =['Mag_id','Core_id'])
auth_df = auth_df.merge(mag_auth,on=['Mag_id']).merge(core_auth,on=['Core_id'])

In [9]:
auth_df

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title
0,1000000185,42023535,"[{'name': 'Sanpei Usui', 'id': '2398577680'}]",Complex structures on partial compactification...,"[Usui, Sanpei]",Complex structures on partial compactification...
1,100000407,82235102,"[{'name': 'Surender Deora', 'id': '2132854008'...",Cardiac Epithelioid Leiomyosarcoma as Both Int...,"[Deora, Surender, Gurmukhani, Sunil, Shah, San...",Cardiac Epithelioid Leiomyosarcoma as Both Int...
2,1000007777,9554974,"[{'name': 'Shengye Zeng', 'id': '2163875822'},...",The investigation of material removal in bonne...,"[Zeng, Shengye, Blunt, Liam, Jiang, Xiang]",The investigation of material removal in bonne...
3,1000013286,36706632,"[{'name': 'Clifford Allison. Rose', 'id': '251...",Time multiplexing of compensation in control s...,"[Rose, Clifford Allison.]",Time multiplexing of compensation in control s...
4,1000014399,6112882,"[{'name': 'Nicky Stanley', 'id': '2171225193'}...",Disclosing Disability: Disabled students and p...,"[Stanley, Nicky, Ridley, Julie, Harris, Jessic...",Disclosing Disability: Disabled students and p...
...,...,...,...,...,...,...
3567226,999972508,82484073,"[{'name': 'Rebecca A. Fischer', 'id': '2164566...",High pressure metal–silicate partitioning of N...,"[Fischer, Rebecca A., Nakajima, Yoichi, Campbe...",High pressure metal–silicate partitioning of N...
3567227,99997835,82317676,"[{'name': 'Alice K. Jacobs', 'id': '2807342118'}]","Women, ischemic heart disease, revascularizati...","[Jacobs, Alice K.]","Women, Ischemic Heart Disease, Revascularizati..."
3567228,999981119,18590244,"[{'name': 'Zhe Kan', 'id': '2649816162'}]",Electrical properties of carbon structures : c...,"[Kan, Zhe]",Electrical properties of carbon structures : c...
3567229,999981997,31012454,"[{'name': 'Georg Futter', 'id': '2124207548'},...",Modeling of transport and degradation processe...,"[Futter, Georg, Jahnke, Thomas, Latz, Arnulf]",Modeling of transport and degradation processe...


Create columns containing the length of the authorlists. Differing list lengths may be an indicator for incorrect matches.
Store data for accessing later, without having to create the dataframes from scratch again.

In [10]:
auth_df['Len_mag'] = [len(auth) for auth in auth_df['Mag_Authors']]
auth_df['Len_core'] = [len(auth) for auth in auth_df['Core_Authors']]

In [12]:
auth_df.to_json('results/mag_match/matching_checkup.json')

In [4]:
auth_df = pd.read_json('results/mag_match/matching_checkup.json')

Define function to compare authorlists. Function takes two lists a and b, and returns a tuple with four values:  

    The number of authors of both lists that match by either first or last name 
    The number of authors that match by their first name  
    The number of authors that match by their last name  
    The number of authors of both lists that match by both their first and last name  
    
Method is tested on an example list to see whether it returns the intended values

In [34]:
def compare_l_f(a,b):
    m_l_f = 0
    m_l = 0
    m_f = 0
    m_l_f_both = 0
    for auth_m in a:
        lname = auth_m['name'].split()[-1].lower()
        fname = auth_m['name'].split()[0].lower()
        for n in b:
            if not isinstance(n,str):
                continue
            nl = n.lower()
            lhit = False
            fhit = False
            if nl.find(lname)>-1:
                m_l += 1
                lhit = True
            if nl.find(fname)>-1:
                m_f += 1
                fhit = True
            if lhit and fhit:
                m_l_f_both += 1
            if lhit or fhit:
                m_l_f += 1
            
    return (m_l_f, m_l, m_f, m_l_f_both)

In [35]:
compare_l_f([{'name': 'Anna Maria', 'id': '2398577680'},{'name': 'Julia Bauer', 'id': '2398577680'},{'name': 'Petra Müller', 'id': '2398577680'},{'name': 'Vivian Schneider', 'id': '2398577680'}],['Maria, Anna','Bauer, Greta','Tante, Petra','Usui, Sanpei'])

(3, 2, 2, 1)

Import module for Levenshtein string comparison  
Define compare_lev as function which performs Levenshtein compare between the last name of the MAG-author and all authors in the CORE list. Returns the number of MAG authors for whom a sufficiently similar name was found in the CORE list.

In [18]:
from Levenshtein import distance as levenshtein_distance
def levenshtein_compare(t1, t2):
    if not t1 or not t2:
        return False
    return (levenshtein_distance(t1.lower(), t2.lower()) < abs(len(t1)-len(t2)) + 0.1*min(len(t1),len(t2)) and not ((len(t1) < 20 or len(t2) < 20) and abs(len(t1)-len(t2)) > 30))

In [19]:
def compare_lev(mag,core):
    return sum([any([True if isinstance(core_a,str) and levenshtein_compare(core_a, mag_a['name'].split()[-1]) else False for core_a in core]) for mag_a in mag])

Use the function defined above to find the numbers of authors matched by first, last, both or either of those names and store the numbers in a column each

In [36]:
auth_df['Known_comb'] = [compare_l_f(mag,core) for mag, core in zip(auth_df['Mag_Authors'],auth_df['Core_Authors'])]

In [37]:
auth_df['Known_l'] = [x[1] for x in auth_df['Known_comb']]
auth_df['Known_f'] = [x[2] for x in auth_df['Known_comb']]
auth_df['Known_l_f'] = [x[0] for x in auth_df['Known_comb']]
auth_df['Known_l+f'] = [x[3] for x in auth_df['Known_comb']]

In [12]:
auth_df

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
0,1000000185,42023535,"[{'name': 'Sanpei Usui', 'id': '2398577680'}]",Complex structures on partial compactification...,"[Usui, Sanpei]",Complex structures on partial compactification...,1,1,1,1,1,1
1,100000407,82235102,"[{'name': 'Surender Deora', 'id': '2132854008'...",Cardiac Epithelioid Leiomyosarcoma as Both Int...,"[Deora, Surender, Gurmukhani, Sunil, Shah, San...",Cardiac Epithelioid Leiomyosarcoma as Both Int...,6,6,8,6,8,6
2,1000007777,9554974,"[{'name': 'Shengye Zeng', 'id': '2163875822'},...",The investigation of material removal in bonne...,"[Zeng, Shengye, Blunt, Liam, Jiang, Xiang]",The investigation of material removal in bonne...,3,3,3,3,3,3
3,1000013286,36706632,"[{'name': 'Clifford Allison. Rose', 'id': '251...",Time multiplexing of compensation in control s...,"[Rose, Clifford Allison.]",Time multiplexing of compensation in control s...,1,1,1,1,1,1
4,1000014399,6112882,"[{'name': 'Nicky Stanley', 'id': '2171225193'}...",Disclosing Disability: Disabled students and p...,"[Stanley, Nicky, Ridley, Julie, Harris, Jessic...",Disclosing Disability: Disabled students and p...,5,5,5,5,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...
3567226,999972508,82484073,"[{'name': 'Rebecca A. Fischer', 'id': '2164566...",High pressure metal–silicate partitioning of N...,"[Fischer, Rebecca A., Nakajima, Yoichi, Campbe...",High pressure metal–silicate partitioning of N...,9,9,9,9,9,9
3567227,99997835,82317676,"[{'name': 'Alice K. Jacobs', 'id': '2807342118'}]","Women, ischemic heart disease, revascularizati...","[Jacobs, Alice K.]","Women, Ischemic Heart Disease, Revascularizati...",1,1,1,1,1,1
3567228,999981119,18590244,"[{'name': 'Zhe Kan', 'id': '2649816162'}]",Electrical properties of carbon structures : c...,"[Kan, Zhe]",Electrical properties of carbon structures : c...,1,1,1,1,1,1
3567229,999981997,31012454,"[{'name': 'Georg Futter', 'id': '2124207548'},...",Modeling of transport and degradation processe...,"[Futter, Georg, Jahnke, Thomas, Latz, Arnulf]",Modeling of transport and degradation processe...,3,3,3,3,3,3


In [None]:
auth_df = auth_df.drop(['Known_comb'],axis=1)

Now we look for potential mismatches. We start by taking a look at the number of antries where CORE contains more authors than MAG

In [13]:
auth_df[auth_df.Len_mag < auth_df.Len_core]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
8,100002650,143745708,"[{'name': 'Linda Jean. Koska', 'id': '66207075...","THE STRUCTURE AND HARMONIC LANGUAGE OF ""THE DO...","[Koska, Linda Jean., Koska, Linda Jean.]","THE STRUCTURE AND HARMONIC LANGUAGE OF ""THE DO...",1,2,2,2,2,2
23,100008419,11795336,"[{'name': 'Nizam Dahalan', 'id': '2337882912'}]",Development of a computer program for rocket a...,"[Dahalan, Md. Nizam, Su, Vin Cent, Ammoo, Mohd...",Development of a computer program for rocket a...,1,3,1,1,1,1
43,100018775,143746796,"[{'name': 'Tein-Yow Yu', 'id': '849224505'}]",Efficient backtracking strategies in test gene...,"[Yu, Tein-Yow, 1961-, Yu, Tein-Yow, 1961-]",Efficient backtracking strategies in test gene...,1,2,2,2,2,2
48,2247017209,36020396,"[{'name': 'T. Aaltonen', 'id': '2551340074'}]",Measurement of the p anti-p to t anti-t Produc...,"[Aaltonen, T., Casal, Bruno, Cuevas, Javier, G...",Measurement of the pp̅ →tt̅ production cross s...,1,13,1,1,1,1
71,1000276684,143798957,"[{'name': 'Janet Lynn Wagner', 'id': '26571556...",Characterization of the detergent sodium chola...,"[Wagner, Janet Lynn, Wagner, Janet Lynn]",Characterization of the detergent sodium chola...,1,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...
3567109,99960701,143747665,"[{'name': 'Stephen Marcus Green', 'id': '20987...",Manipulation of a rectangular air jet using pi...,"[Green, Stephen Marcus, 1957-, Green, Stephen ...",Manipulation of a rectangular air jet using pi...,1,2,2,2,2,2
3567139,99969003,143751704,"[{'name': 'Deborah Ann Coughlin', 'id': '21239...",Fitting the school to the child: A case histor...,"[Coughlin, Deborah Ann, 1953-, Coughlin, Debor...",Fitting the school to the child: A case histor...,1,2,2,2,2,2
3567159,99975286,33012342,"[{'name': 'N N Pillai', 'id': '2636378198'}, {...",Brackishwater prawn farming in the Ashtamudi l...,"[Nair, P V Ramachandran, Pillai, N N, Pillai, ...",Brackishwater prawn farming in the Ashtamudi l...,7,8,13,28,32,9
3567204,999908821,191028,"[{'name': 'Eilean Hooper-Greenhill', 'id': '22...",Developing a scheme for finding evidence of th...,"[Hooper-Greenhill, Eilean, Resource]",Developing a scheme for finding evidence of th...,1,2,1,1,1,1


In [None]:
auth_df[auth_df.Len_mag < auth_df.Len_core].sample(20)

Results are stored again as the list compares take some time to complete

In [31]:
auth_df.to_json('results/mag_match/matching_checkup.json')

We take a look at some entries with differing length in detail

In [28]:
auth_df.loc[3567139].Mag_Authors

[{'name': 'Deborah Ann Coughlin', 'id': '2123992730'}]

In [29]:
auth_df.loc[3567139].Core_Authors

['Coughlin, Deborah Ann, 1953-', 'Coughlin, Deborah Ann, 1953-']

In [26]:
auth_df.loc[3567159].Mag_Authors

[{'name': 'N N Pillai', 'id': '2636378198'},
 {'name': 'V K Pillai', 'id': '2699148515'},
 {'name': 'P P Pillai', 'id': '2645088690'},
 {'name': 'K J Mathew', 'id': '2656141606'},
 {'name': 'C P Gopinathan', 'id': '2797365781'},
 {'name': 'V K Balachandran', 'id': '2636299615'},
 {'name': 'D Vincent', 'id': '2674116733'}]

In [27]:
auth_df.loc[3567159].Core_Authors

['Nair, P V Ramachandran',
 'Pillai, N N',
 'Pillai, V K',
 'Pillai, P P',
 'Mathew, K J',
 'Gopinathan, C P',
 'Balachandran, V K',
 'Vincent, D']

In [16]:
auth_df.loc[296].Mag_Authors

[{'name': 'Asier Jayo', 'id': '1756440050'},
 {'name': 'Dina Pabón', 'id': '2168065855'},
 {'name': 'Pedro Lastres', 'id': '2196020418'},
 {'name': 'V. Jimenez-Yuste', 'id': '2109685147'},
 {'name': 'Consuelo González-Manchón', 'id': '1974979943'}]

In [17]:
auth_df.loc[296].Core_Authors

['Jayo, Asier',
 'Pabón, Dina',
 'Lastres, Pedro',
 'Jiménez-Yuste, Victor',
 'González-Manchón, Consuelo']

We take a look at the entries where not all authors could be matched and take some samples.

In [46]:
auth_df[(auth_df.Known_l_f < auth_df.Len_mag) & (auth_df.Known_l_f < auth_df.Len_core)]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
116,2005087574,19588467,"[{'name': 'A. Benavent-Climent', 'id': '173474...",Shaking table tests of structures with hystere...,"[Benavent Climent, Amadeo, Escolano Margarit, ...",Shaking table tests of structures with hystere...,2,2,0,1,1,0
296,1001191458,36108925,"[{'name': 'Asier Jayo', 'id': '1756440050'}, {...",Type II Glanzmann thrombasthenia in a compound...,"[Jayo, Asier, Pabón, Dina, Lastres, Pedro, Jim...",Type II Glanzmann thrombasthenia in a compound...,5,5,4,4,4,4
521,100208094,70286,"[{'name': 'Olga Mudraya', 'id': '2076331398'},...",English-Russian-Finnish cross-language compari...,"[Mudraya, O., Piao, S., Lofberg, L., Rayson, P...",English-Russian-Finnish cross-language compari...,5,5,4,0,4,0
657,1002592871,42931908,"[{'name': 'Gyula Katona', 'id': '2794350191'},...",Length of Sums in a Minkowski Space,"[Katona, Gyula, Mayer, R., Woyczynski, W. A.]",Length of Sums in a Minkowski Space,3,3,2,2,2,2
879,1003415266,94283594,"[{'name': 'Zhuravlev Yuriy Ivanovich', 'id': '...",The new approaches to problem of polymorbidity,"[Zhuravlev, Y. I., Tkhorikova, V. N.]",The new approaches to problem of polymorbidity,2,2,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3566745,998132775,10924117,"[{'name': 'Bohumír Cagaš', 'id': '1703597667'}...",The effect of grasses grown for seed in mixtur...,"[Cagas, Bohumir, Machac, Radek]",The effect of grasses grown for seed in mixtur...,2,2,0,1,1,0
3566787,998279759,11857567,"[{'name': 'G. Modla', 'id': '1527749307'}, {'n...",Entrainer selection for pressure swing batch d...,"[Modla, Gábor, Láng, Péter, Kopasz, Árpád]",Entrainer selection for pressure swing batch d...,3,3,2,0,2,0
3567096,999549861,82522043,"[{'name': 'Andrea Sujová', 'id': '2310056901'}...",Modern Methods of Process Management Used in S...,"[Sujova, Andrea, Marcinekova, Katarina]",Modern Methods of Process Management Used in S...,2,2,0,1,1,0
3567148,999726932,36088314,"[{'name': 'Irene Gonzalez-Valls', 'id': '20322...",Aligned semiconductor oxide nanostructures for...,"[González-Valls, Irene, González-García, Lola,...",Aligned semiconductor oxide nanostructures for...,8,8,6,6,7,5


In [47]:
auth_df[(auth_df.Known_f < auth_df.Len_mag) & (auth_df.Known_f < auth_df.Len_core)]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
10,1000030475,43097361,"[{'name': 'Clara Kulich', 'id': '1971919205'},...",Signaling change during a crisis: refining con...,"[Kulich, C, Lorenzi-Cioldi, F, Iacoviello, V, ...",Signaling change during a crisis: Refining con...,5,5,5,0,5,0
13,1000056,37502359,"[{'name': 'Fabricia Petronilho', 'id': '205885...",Gastrin-Releasing Peptide Receptor Antagonism ...,"[Petronilho, Fabricia, Vuolo, Francieli, Galan...",Gastrin-Releasing Peptide Receptor Antagonism ...,23,22,23,21,25,19
17,1000064551,81137944,"[{'name': 'Daniel Menezes Silvestre', 'id': '2...",Feasibility study of calibration strategy for ...,"[Silvestre, Daniel Menezes, Barbosa, Felipe Mi...",Feasibility study of calibration strategy for ...,5,5,5,4,5,4
20,1000069329,55537363,"[{'name': 'Halszka Jarodzka', 'id': '302381991...",Cognitive Skills in Medicine,"[Jarodzka, Halszka, Boshuizen, Els, Kirschner,...",Cognitive Skills in Catheter-based Cardiovascu...,3,3,3,2,3,2
40,100017267,29258100,"[{'name': 'David Lentink', 'id': '2116162051'}]",Exploring the biofluiddynamics of swimming and...,"[Lentink, D.]",Exploring the biofluiddynamics of swimming and...,1,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3567195,999883381,36047224,"[{'name': 'Pelayo González', 'id': '2175088314...",Functional polymorphism in the promoter region...,"[González, P., Díez-Juan, Antonio, Andrés, Vic...",Functional polymorphism in the promoter region...,5,5,4,4,5,3
3567196,999892512,81171909,"[{'name': 'Pranas Baltrenas', 'id': '226744247...",Experimental Research of Odours Arising During...,"[Baltrenas, Pranas, Misevicius, Antonas, Macai...",Experimental Research of Odours Arising During...,4,4,2,3,3,2
3567197,99989776,78248,"[{'name': 'Nicholas Ryder', 'id': '2097380707'}]",The finacial services authority and money laun...,"[Ryder, N.]",The finacial services authority and money laun...,1,1,1,0,1,0
3567210,999922911,48165576,"[{'name': 'Julien Dagher', 'id': '2078178394'}...",Wild-type VHL Clear Cell Renal Cell Carcinomas...,"[Dagher, Julien, Kammerer-Jacquet, Solene-Flor...",Wild-type VHL Clear Cell Renal Cell Carcinomas...,14,14,14,12,15,11


In [48]:
auth_df[(auth_df.Known_l < auth_df.Len_mag) & (auth_df.Known_l < auth_df.Len_core)]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
116,2005087574,19588467,"[{'name': 'A. Benavent-Climent', 'id': '173474...",Shaking table tests of structures with hystere...,"[Benavent Climent, Amadeo, Escolano Margarit, ...",Shaking table tests of structures with hystere...,2,2,0,1,1,0
199,100079032,12356592,"[{'name': 'O. O. Ajayi', 'id': '2470271829'}, ...",Assessment of Wind Power Potential and Wind El...,"[Ajayi, O. O., Fagbenle, R. O., Katende, J.]",Assessment of Wind Power Potential and Wind El...,3,3,2,3,3,2
235,1000955919,130019579,"[{'name': 'Anik Faujiah', 'id': '1995878208'}]",Aspek hukum pewarnaan rambut: studi kasus di k...,"[Faujiyah, Anik]",Aspek hukum pewarnaan rambut: studi kasus di k...,1,1,0,1,1,0
255,100103939,16388691,"[{'name': 'Ryota Nakajima', 'id': '2099235777'...",Sedimentation impacts on the growth rates of t...,"[R., Nakajima, T., Yoshida, Y., Fuchinove, T.,...",Sedimentation impacts on the growth rates of t...,8,8,7,9,13,3
296,1001191458,36108925,"[{'name': 'Asier Jayo', 'id': '1756440050'}, {...",Type II Glanzmann thrombasthenia in a compound...,"[Jayo, Asier, Pabón, Dina, Lastres, Pedro, Jim...",Type II Glanzmann thrombasthenia in a compound...,5,5,4,4,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...
3567016,999271521,33667039,"[{'name': 'Saskia Brix', 'id': '2048068937'}, ...",IceAGE - Icelandic marine Animals: Genetics an...,"[Brix, Saskia, Martinez, Pedro, Svavarsson, Jö...",IceAGE - Icelandic marine Animals: Genetics an...,11,11,10,11,11,10
3567096,999549861,82522043,"[{'name': 'Andrea Sujová', 'id': '2310056901'}...",Modern Methods of Process Management Used in S...,"[Sujova, Andrea, Marcinekova, Katarina]",Modern Methods of Process Management Used in S...,2,2,0,1,1,0
3567148,999726932,36088314,"[{'name': 'Irene Gonzalez-Valls', 'id': '20322...",Aligned semiconductor oxide nanostructures for...,"[González-Valls, Irene, González-García, Lola,...",Aligned semiconductor oxide nanostructures for...,8,8,6,6,7,5
3567195,999883381,36047224,"[{'name': 'Pelayo González', 'id': '2175088314...",Functional polymorphism in the promoter region...,"[González, P., Díez-Juan, Antonio, Andrés, Vic...",Functional polymorphism in the promoter region...,5,5,4,4,5,3


In [50]:
s_m_ids = set(load(open('results/mag_match/ids_for_single_multi_core_extraction.json','r'))['ids'])
auth_df[(auth_df.Core_id.isin(s_m_ids)) & (auth_df.Known_l_f < auth_df.Len_mag) & (auth_df.Known_l_f < auth_df.Len_core)]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
521,100208094,70286,"[{'name': 'Olga Mudraya', 'id': '2076331398'},...",English-Russian-Finnish cross-language compari...,"[Mudraya, O., Piao, S., Lofberg, L., Rayson, P...",English-Russian-Finnish cross-language compari...,5,5,4,0,4,0
962,100373180,44234818,"[{'name': 'E. Adli', 'id': '2050692267'}, {'na...",Status of an automatic Beam Steering for the C...,"[Adli, E, Corsini, R, Dabrowski, A, Schulte, D...",Status of an automatic Beam Steering for the C...,8,8,6,1,6,1
1239,1004814664,82378655,"[{'name': 'Hulya Caliskan', 'id': '2133431902'}]",Technological Change and Economic Growth,"[Çalışkan, Hülya Kesici]",Technological Change and Economic Growth,1,1,0,0,0,0
3148,1930314859,39279954,"[{'name': 'ShikamaT.', 'id': '2249092368'}, {'...",Plasma polarization spectroscopy of atomic and...,"[Shikama, T., Fujii, K., Kado, S., Zushi, H., ...",Plasma polarization spectroscopy of atomic and...,9,9,0,0,0,0
3526,1013294644,82002019,"[{'name': 'Adrian Munguia-Vega', 'id': '545238...",Marine reserves help preserve genetic diversit...,"[Munguía-Vega, Adrián, Sáenz-Arroyo, Andrea, G...",Marine reserves help preserve genetic diversit...,7,7,6,6,6,6
...,...,...,...,...,...,...,...,...,...,...,...,...
3564943,991187291,48295270,"[{'name': 'Vincenzo De Maio', 'id': '215774425...",Modelling energy consumption of network transf...,"[De Maio, Vincenzo, Prodan, Radu, Benedict, Sh...",Modelling Energy Consumption of Network Transf...,4,4,3,3,3,3
3565686,99412176,11758582,"[{'name': 'S. N. Losa', 'id': '2613851832'}, {...",Estimating biological parameters of a size-dep...,"[Loza, Svetlana, Schröter, Jens, Wenzel, Manfr...",Estimating biological parameters of a size-dep...,4,5,3,0,3,0
3565835,994716540,70382584,"[{'name': 'Yael R. Nobel', 'id': '1977778394'}...",Metabolic and metagenomic outcomes from early-...,"[Abubucker, Sahar, Zhou, Yanjiao, Mitreva, Mak...",Metabolic and metagenomic outcomes from early-...,20,6,5,5,5,5
3565900,994976729,36015984,"[{'name': 'Jesús Ruiz Amaya', 'id': '276399584...",A 0.13μm CMOS Current Steering D/A Converter f...,"[Ruiz Amaya, Jesús, Fernández-Bootello, Juan F...",A 0.13μm CMOS Current Steering D/A Converter f...,4,4,3,3,3,3


In [54]:
auth_df[auth_df.Known_l_f == 0].sample(n=20)

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
2223095,2140293347,81061579,"[{'name': 'Michael P. W. Grocott', 'id': '1171...",Everest 60 years on: what next?,[],Everest 60 years on: what next?,3,0,0,0,0,0
2905384,2472323562,81806797,"[{'name': 'Dona Fleishaker', 'id': '2316449986...",Safety and pharmacodynamic dose response of sh...,[],Safety and pharmacodynamic dose response of sh...,5,0,0,0,0,0
3036791,2549714162,81271064,"[{'name': 'Jiang Long', 'id': '2594970808'}, {...",Prevalence and correlates of problematic smart...,[],Prevalence and correlates of problematic smart...,10,0,0,0,0,0
618489,1966610449,81819726,"[{'name': 'Ye Ding', 'id': '2766019190'}, {'na...",Fish genome manipulation and directional breeding,[],Fish genome manipulation and directional breeding,3,0,0,0,0,0
1190458,2025099632,81195933,"[{'name': 'Berno Bucker', 'id': '2037640033'},...",The effect of reward on orienting and reorient...,[],The effect of reward on orienting and reorient...,2,0,0,0,0,0
2077065,2121494685,81701871,"[{'name': 'Philipp M. Keune', 'id': '243249279...",Dynamic walking features and improved walking ...,[],Dynamic walking features and improved walking ...,8,0,0,0,0,0
3121647,2589066055,81276086,"[{'name': 'Jeffrey J. Kim', 'id': '2543414348'...",Universal electronic-cigarette test: physioche...,[],Universal electronic-cigarette test: physioche...,9,0,0,0,0,0
2283718,2148113627,81804806,"[{'name': 'N. Giordano', 'id': '2151326331'}, ...",Physical modeling of the piano,[],Physical Modeling of the Piano,2,0,0,0,0,0
1180063,2024027755,81745018,"[{'name': 'Chien-Chung Kuo', 'id': '2420987193...",Ankle morphometry in the Chinese population,[],Ankle morphometry in the Chinese population,7,0,0,0,0,0
2803267,2341141931,81792727,"[{'name': 'Vibeke Børsholt Rudkjøbing', 'id': ...",Comparing culture and molecular methods for th...,[],Comparing culture and molecular methods for th...,12,0,0,0,0,0


In [55]:
auth_df[(auth_df.Known_l_f < auth_df.Len_mag) & (auth_df.Known_l_f < auth_df.Len_core)].sample(20)

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f
844264,1989671159,44119603,"[{'name': 'Vincent Revéret', 'id': '2798873930...",A study on the use of the PACS bolometer array...,"[Reveret, V, Rodriguez, L R, André, P, Horeau,...",A study on the use of the PACS bolometer array...,7,6,5,0,5,0
3298522,2763674566,2448810,"[{'name': 'J. Breitweg', 'id': '721961190'}]",Measurement of jet shapes in high,[ZEUS Collaboration],Measurement of Jet Shapes in High-Q**2 Deep In...,1,1,0,0,0,0
618083,1966568822,29221715,"[{'name': 'Romana Novaković', 'id': '185199666...",Micronutrient intake and status in Central and...,"[Novakovic, R.N., Cavelaars, A.J.E.M., Bekkeri...",Micronutrient intake and status in Central and...,15,14,11,0,11,0
1242409,2030437588,11288286,"[{'name': 'Claire Booth', 'id': '2120746571'},...",X-linked lymphoproliferative disease due to SA...,"[Booth, C, Gilmour, K C, Veys, P, Gennery, A R...",X-linked lymphoproliferative disease due to SA...,48,48,46,1,47,0
1600344,2067190946,2366490,"[{'name': 'Marián Boguñá', 'id': '111975288'},...",Cut-offs and finite size effects in scale-free...,"[Boguna, Marian, Pastor-Satorras, Romualdo, Ve...",Cut-offs and finite size effects in scale-free...,3,3,2,2,2,2
3380736,37835326,148657941,[{'name': 'José Antonio Calvo-Manzano Villalón...,CONEVTO: Contract Evaluation tool for Software...,"[Calvo-Manzano Villalon, Jose Antonio, Cuevas ...",CONEVTO: Contract Evaluation tool for Software...,5,5,3,3,4,2
2145401,2130290299,79095399,"[{'name': 'A. M. Bach', 'id': '2144667830'}]",Search for pair production of a new b′ quark t...,"[Aad, G., ATLAS Collaboration]",Search for pair production of a new b′ quark t...,1,2,0,0,0,0
3556169,95620398,9417703,"[{'name': 'Andrew Ramsay', 'id': '2601243374'}...","Sputum, sex and scanty smears: new case defini...","[Ramsay, A, Bonnet, M, Gagnidze, L, Githui, W,...","Sputum, sex and scanty smears: new case defini...",6,6,5,0,5,0
1097697,2015561918,13453422,"[{'name': 'Adrian Covaci', 'id': '1814458907'}...",Determination of organohalogenated contaminant...,"[Covaci, Adrian, Van de Vijver, Kristin Inneke...",Determination of organohalogenated contaminant...,7,7,6,4,6,4
1425106,2049188768,20119223,"[{'name': 'Rajat Acharyya', 'id': '2163203943'...",Income based price subsidies and parallel imports,"[Acharyya, Rajat, Garcia-Alonso, Maria D C]",Income based price subsidies and parallel imports,2,2,1,1,1,1


In [56]:
with open(mag_path) as f:
    for i,line in enumerate(f):
        j = loads(line)
        if j['id'] == '2030437588':
            print(j)
            break
print(datetime.now())

{'id': '2030437588', 'title': 'X-linked lymphoproliferative disease due to SAP/SH2D1A deficiency: a multicenter study on the manifestations, management and outcome of the disease', 'authors': [{'name': 'Claire Booth', 'id': '2120746571'}, {'name': 'Kimberly Gilmour', 'id': '2892895508'}, {'name': 'Paul Veys', 'id': '2145977222'}, {'name': 'Andrew R. Gennery', 'id': '84067822'}, {'name': 'Mary Slatter', 'id': '740779825'}, {'name': 'Helen Chapel', 'id': '344128693'}, {'name': 'Paul T. Heath', 'id': '2097140458'}, {'name': 'Colin G. Steward', 'id': '2252320061'}, {'name': 'Owen P. Smith', 'id': '2130009484'}, {'name': "Anna O'Meara", 'id': '2618931660'}, {'name': 'Hilary Kerrigan', 'id': '2564349294'}, {'name': 'Nizar Mahlaoui', 'id': '314198558'}, {'name': 'Marina Cavazzana-Calvo', 'id': '2030339593'}, {'name': 'Alain Fischer', 'id': '2238277596'}, {'name': 'Despina Moshous', 'id': '51054839'}, {'name': 'Stéphane Blanche', 'id': '2224366835'}, {'name': 'Jana Pachlopnick-Schmid', 'id': '

We load the data generated in earlier steps of this notebook. Now we create another column which contains the number of authors between the entries matched by a Levenshtein compare of the names

In [4]:
import pandas as pd

auth_df = pd.read_json('results/mag_match/matching_checkup.json')

In [30]:
auth_df['Known_lev'] = [compare_lev(mag,core) for mag, core in zip(auth_df['Mag_Authors'],auth_df['Core_Authors'])]

In [32]:
auth_df

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f,Known_lev
0,1000000185,42023535,"[{'name': 'Sanpei Usui', 'id': '2398577680'}]",Complex structures on partial compactification...,"[Usui, Sanpei]",Complex structures on partial compactification...,1,1,1,1,1,1,1
1,100000407,82235102,"[{'name': 'Surender Deora', 'id': '2132854008'...",Cardiac Epithelioid Leiomyosarcoma as Both Int...,"[Deora, Surender, Gurmukhani, Sunil, Shah, San...",Cardiac Epithelioid Leiomyosarcoma as Both Int...,6,6,8,6,8,6,6
2,1000007777,9554974,"[{'name': 'Shengye Zeng', 'id': '2163875822'},...",The investigation of material removal in bonne...,"[Zeng, Shengye, Blunt, Liam, Jiang, Xiang]",The investigation of material removal in bonne...,3,3,3,3,3,3,3
3,1000013286,36706632,"[{'name': 'Clifford Allison. Rose', 'id': '251...",Time multiplexing of compensation in control s...,"[Rose, Clifford Allison.]",Time multiplexing of compensation in control s...,1,1,1,1,1,1,1
4,1000014399,6112882,"[{'name': 'Nicky Stanley', 'id': '2171225193'}...",Disclosing Disability: Disabled students and p...,"[Stanley, Nicky, Ridley, Julie, Harris, Jessic...",Disclosing Disability: Disabled students and p...,5,5,5,5,5,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3567226,999972508,82484073,"[{'name': 'Rebecca A. Fischer', 'id': '2164566...",High pressure metal–silicate partitioning of N...,"[Fischer, Rebecca A., Nakajima, Yoichi, Campbe...",High pressure metal–silicate partitioning of N...,9,9,9,9,9,9,9
3567227,99997835,82317676,"[{'name': 'Alice K. Jacobs', 'id': '2807342118'}]","Women, ischemic heart disease, revascularizati...","[Jacobs, Alice K.]","Women, Ischemic Heart Disease, Revascularizati...",1,1,1,1,1,1,1
3567228,999981119,18590244,"[{'name': 'Zhe Kan', 'id': '2649816162'}]",Electrical properties of carbon structures : c...,"[Kan, Zhe]",Electrical properties of carbon structures : c...,1,1,1,1,1,1,1
3567229,999981997,31012454,"[{'name': 'Georg Futter', 'id': '2124207548'},...",Modeling of transport and degradation processe...,"[Futter, Georg, Jahnke, Thomas, Latz, Arnulf]",Modeling of transport and degradation processe...,3,3,3,3,3,3,3


In [33]:
auth_df[auth_df.Known_lev == 0]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f,Known_lev
53,1000214280,82394347,"[{'name': 'Virtue Registry Investigators', 'id...",Mid-term outcomes and aortic remodelling after...,[],Mid-term Outcomes and Aortic Remodelling After...,1,0,0,0,0,0,0
57,1000229741,81870759,"[{'name': 'Monika Riegel', 'id': '1968574148'}...",Characterization of the Nencki Affective Pictu...,[],Characterization of the Nencki Affective Pictu...,13,0,0,0,0,0,0
152,1000581368,82342819,"[{'name': 'İsmail Bircan', 'id': '2314037064'}]",Analysis of Innovation-Based Human Resources f...,"[Bircan, İsmail, Gençler, Funda]",Analysis of Innovation-Based Human Resources f...,1,2,1,1,1,1,0
187,100073140,143779951,"[{'name': 'Suzanne Elanore Turner Murray', 'id...",A comparison of some computational techniques ...,"[Murray, Suzanne Elanore Turner, 1940-, Murray...",A comparison of some computational techniques ...,1,2,2,2,2,2,0
346,100140187,15158514,"[{'name': 'W. G. H. Maxwell', 'id': '213743812...",Upper Palaeozoic formations in the Mt. Morgan ...,"[Maxwell, W. G. H. (William Graham Henderson)]",Upper Palaeozoic formations in the Mt. Morgan ...,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3566783,998266863,81891020,"[{'name': 'Christian Ritzer', 'id': '241427995...",Measurement of the point spread function of a ...,[],Measurement of the point spread function of a ...,4,0,0,0,0,0,0
3566873,998676226,81529672,"[{'name': 'Mpho Keetile', 'id': '87376778'}, {...",Patterns and determinants of hypertension in B...,[],Patterns and determinants of hypertension in B...,3,0,0,0,0,0,0
3566943,99898334,81734742,"[{'name': 'Paula Tuma', 'id': '2293251928'}, {...",Outcome of chronic hepatitis delta in patients...,[],Outcome of chronic hepatitis delta in patients...,9,0,0,0,0,0,0
3566991,99917349,81067737,"[{'name': 'Zhenlong Liu', 'id': '2105551457'},...",The interferon-induced MxB protein inhibits an...,[],The interferon-induced MxB protein inhibits an...,8,0,0,0,0,0,0


In [34]:
auth_df[(auth_df.Known_lev < auth_df.Len_mag) & (auth_df.Known_lev < auth_df.Len_core)]

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f,Known_lev
152,1000581368,82342819,"[{'name': 'İsmail Bircan', 'id': '2314037064'}]",Analysis of Innovation-Based Human Resources f...,"[Bircan, İsmail, Gençler, Funda]",Analysis of Innovation-Based Human Resources f...,1,2,1,1,1,1,0
187,100073140,143779951,"[{'name': 'Suzanne Elanore Turner Murray', 'id...",A comparison of some computational techniques ...,"[Murray, Suzanne Elanore Turner, 1940-, Murray...",A comparison of some computational techniques ...,1,2,2,2,2,2,0
199,100079032,12356592,"[{'name': 'O. O. Ajayi', 'id': '2470271829'}, ...",Assessment of Wind Power Potential and Wind El...,"[Ajayi, O. O., Fagbenle, R. O., Katende, J.]",Assessment of Wind Power Potential and Wind El...,3,3,2,3,3,2,2
255,100103939,16388691,"[{'name': 'Ryota Nakajima', 'id': '2099235777'...",Sedimentation impacts on the growth rates of t...,"[R., Nakajima, T., Yoshida, Y., Fuchinove, T.,...",Sedimentation impacts on the growth rates of t...,8,8,7,9,13,3,7
305,100120857,20342195,"[{'name': 'Marta Luna Serrano', 'id': '2104771...",Dysfunctional 3D model based on structural and...,"[Luna Serrano, Marta, Caballero Hernandez, Rut...",Dysfunctional 3D model based on structural and...,10,10,9,12,12,9,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3567016,999271521,33667039,"[{'name': 'Saskia Brix', 'id': '2048068937'}, ...",IceAGE - Icelandic marine Animals: Genetics an...,"[Brix, Saskia, Martinez, Pedro, Svavarsson, Jö...",IceAGE - Icelandic marine Animals: Genetics an...,11,11,10,11,11,10,10
3567096,999549861,82522043,"[{'name': 'Andrea Sujová', 'id': '2310056901'}...",Modern Methods of Process Management Used in S...,"[Sujova, Andrea, Marcinekova, Katarina]",Modern Methods of Process Management Used in S...,2,2,0,1,1,0,1
3567148,999726932,36088314,"[{'name': 'Irene Gonzalez-Valls', 'id': '20322...",Aligned semiconductor oxide nanostructures for...,"[González-Valls, Irene, González-García, Lola,...",Aligned semiconductor oxide nanostructures for...,8,8,6,6,7,5,7
3567195,999883381,36047224,"[{'name': 'Pelayo González', 'id': '2175088314...",Functional polymorphism in the promoter region...,"[González, P., Díez-Juan, Antonio, Andrés, Vic...",Functional polymorphism in the promoter region...,5,5,4,4,5,3,4


In [6]:
auth_df[(auth_df.Known_lev < auth_df.Len_mag) & (auth_df.Known_lev < auth_df.Len_core)].sample(20)

Unnamed: 0,Mag_id,Core_id,Mag_Authors,Mag_Title,Core_Authors,Core_Title,Len_mag,Len_core,Known_l,Known_f,Known_l_f,Known_l+f,Known_lev
3078091,2566879760,93347338,"[{'name': 'Sebastian Dullien', 'id': '20374438...",The IMF to the rescue: Did the euro area benef...,"[Dullien, Sebastian, Fritz, Barbara, M\ufchlic...",The IMF to the rescue: Did the euro area benef...,3,3,2,3,3,2,2
395629,2578560219,16293427,"[{'name': 'Piotr S. Żuchowski', 'id': '2590333...",Low-energy collisions ofNH3andND3with ultracol...,"[Zuchowski, Piotr S., Hutson, Jeremy M.]",Low-energy collisions of NH3 and ND3 with ultr...,2,2,1,2,2,1,1
1001203,2005719012,13736463,"[{'name': 'Anders Falk Vikbjerg', 'id': '20252...",Synthesis of structured phospholipids by immob...,"[Vikbjerg, Anders Falk, Vikbjerg, Anders Falk,...",Synthesis of structured phospholipids by immob...,3,3,3,3,3,3,2
1072011,2012946639,62456001,"[{'name': 'Acilina Caneco', 'id': '289769456'}...",Kneading theory analysis of the Duffing equation,"[Caneco, Acilina, Rocha, Jose, Gracio, Clara]",Kneading theory analysis of the Duffing equation,3,3,2,2,3,1,2
2974136,2519659746,74658331,"[{'name': 'Thomas De Schryver', 'id': '2194005...",In-line NDT with X-Ray CT combining sample rot...,"[De Schryver, Thomas, Dhaene, Jelle, Dierick, ...",In-line NDT with X-Ray CT combining sample rot...,10,10,9,11,11,9,9
1133593,2019229259,50691874,"[{'name': 'Henrik Hein Lauridsen', 'id': '2058...",Development of the young spine questionnaire,"[Lauridsen, Henrik Hein, Hestbæk, Lise]",Development of the young spine questionnaire,2,2,1,2,2,1,1
2179515,2749636074,85125898,"[{'name': 'Suraj Peri', 'id': '2016816938'}, {...",Development of human protein reference databas...,"[Peri, Suraj, Navarro, J.Daniel, Amanchy, Rama...",Development of human protein reference databas...,50,50,52,65,69,48,49
572470,1942594925,81144151,"[{'name': 'William W. Hope', 'id': '2096597877...",EUCAST Technical Note on Voriconazole and Aspe...,"[Hope, W.W., Cuenca-Estrella, M., Lass-Florl, ...",EUCAST Technical Note on Voriconazole and Aspe...,4,4,3,0,3,0,3
943928,1999845618,13718277,"[{'name': 'C. B. Collins', 'id': '2155199751'}...","Microstructural analyses of amorphic diamond, ...","[Collins, C. B., Davanloo, F., Jander, D.R., L...","Microstructural analyses of amorphic diamond, ...",9,9,8,8,10,6,8
800717,1985252301,13733564,"[{'name': 'Mads Brandbyge', 'id': '2462294175'...",Density-functional method for nonequilibrium e...,"[Brandbyge, Mads, Mozos, J.L., Ordejon, P., Ta...",Density-functional method for nonequilibrium e...,5,5,4,3,4,3,4


In [35]:
def authors_full(id):
    print(auth_df.loc[id].Mag_Authors)
    print('\n')
    print(auth_df.loc[id].Core_Authors)

In [36]:
authors_full(3567196)

[{'name': 'Pranas Baltrenas', 'id': '2267442476'}, {'name': 'Antonas Misevičius', 'id': '2506408229'}, {'name': 'Kęstutis Mačaitis', 'id': '2313707204'}, {'name': 'Ruta Tekoriene', 'id': '2607635964'}]


['Baltrenas, Pranas', 'Misevicius, Antonas', 'Macaitis, Kestutis', 'Tekoriene, Ruta']


In [37]:
authors_full(3567148)

[{'name': 'Irene Gonzalez-Valls', 'id': '2032253405'}, {'name': 'Lola González-García', 'id': '2023463444'}, {'name': 'Belén Ballesteros', 'id': '2106546204'}, {'name': 'F. Güell', 'id': '2343592621'}, {'name': 'Angel Barranco', 'id': '2133005086'}, {'name': 'Youhai Yu', 'id': '2105708633'}, {'name': 'Agustín R. González-Elipe', 'id': '1956032063'}, {'name': 'Monica Lira-Cantu', 'id': '2006387408'}]


['González-Valls, Irene', 'González-García, Lola', 'Ballesteros, Belén', 'Güell, F.', 'Barranco, Ángel', 'Yu, Youhai', 'González-Elipe, Agustín R.', 'Lira-Cantú, Mónica']


### After comparing authorlists of matched articles by a variety of measures, differences in the lists could be found for around 158K entries out of 3.5M. No clear mismatch was found.