# Repeated Offenders in MAG

In this notebook, we shall use repeated offenders extracted from Retraction Watch (RW) to identify them in MAG.

We will do so as follows:

1. Read the repeated offenders we identified in RW
2. Read the authors we identified from MAG
3. Identify the records for each repeated offenders
4. For each identified record, check for those offenders in the matched records in MAG using exact + fuzzy matching
5. If the author is identified based on fuzzy score > 90, we flag that author as a repeated offender

In [75]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config
from rapidfuzz import process, fuzz

In [76]:
# Reading paths
paths = read_config()
PROCESSED_REPEATED_OFFENDERS = paths['PROCESSED_REPEATED_OFFENDERS']
PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH = paths['PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [77]:
# Step 1
df_repeated_offenders = pd.read_csv(PROCESSED_REPEATED_OFFENDERS)
df_repeated_offenders['AuthorNameNorm'] = df_repeated_offenders['AuthorName'].str.lower()
df_repeated_offenders = df_repeated_offenders[df_repeated_offenders['FirstRetractionYear'].le(2015)]
df_repeated_offenders.head()

Unnamed: 0,Record ID,AuthorName,nRetracted,FirstRetractionYear,OffenseSameYear,AuthorNameNorm
3,17742,A Abdel Motelib,2,2015,True,a abdel motelib
4,17766,A Abdel Motelib,2,2015,True,a abdel motelib
5,4541,A Abou-Elela,2,2009,True,a abou-elela
6,4630,A Abou-Elela,2,2009,True,a abou-elela
7,4036,A Anthony,2,2010,True,a anthony


In [78]:
df_repeated_offenders['AuthorName'].nunique()

3921

In [79]:
# Let us create a dictionary from repeated_offenders to all the record ids
# Group by AuthorName and aggregate Record IDs into a list
offender_to_records = df_repeated_offenders.groupby('AuthorNameNorm')['Record ID'].apply(list).to_dict()

offender_to_records

{'a a zafar': [16742, 8462],
 'a abdel motelib': [17742, 17766],
 'a abou-elela': [4541, 4630],
 'a anthony': [4036, 1050],
 'a antony joseph': [17262, 2877],
 'a c jesudoss prabhakaran': [4136, 4294],
 'a d nkengafac': [7869, 7871],
 'a gattoni': [3383, 3382],
 'a giolis': [2865, 2510],
 'a harkavyi': [1543, 2901],
 'a iordache': [4526, 4527],
 'a jake demetris': [3462, 6605],
 'a k gupta': [19185, 7764, 820],
 'a kumar': [3537, 3742],
 'a m k el-ghonemy': [5267, 5152, 18225, 18192, 18208],
 'a n al-isa': [1792, 20842],
 'a parlato': [3383, 3382],
 'a r m yusoff': [17167, 17168],
 'a rajendran': [17012, 17013],
 'a simon': [6367, 4471],
 'a venkata rao': [23075, 23073],
 'a w gardner': [1346, 3003],
 'aadithya b urs': [4607, 3584, 3583],
 'aaron s dumont': [21268, 2871],
 'abdeladhim ben abdeladhim': [1329, 325, 1330],
 'abdelilah chaoui': [2423, 4703],
 'abderrahman abdelkefi': [1329, 325, 1330],
 'abdolmajid bayandori moghaddam': [5476, 5477, 16727, 16728, 16584],
 'abdullah agit': 

In [13]:
# Step 2
df_authors_mag = pd.read_csv(PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH)
df_authors_mag.head()

Unnamed: 0,MAGPID,MAGAID,MAGAffID,MAGAuthorOrder,MAGTitle,MAGPubYear,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear
0,524,569306176,148283060.0,1,the cryo thermochromatographic separator cts a...,2002.0,u w kirbach,23311,2273790572,2002.0
1,1598005161,569306176,148283060.0,8,non observation of the production of superheav...,2002.0,u w kirbach,23311,2273790572,2002.0
2,1968637420,569306176,,5,publisher s note confirmation of production of...,2003.0,u w kirbach,23311,2273790572,2002.0
3,1989365649,569306176,148283060.0,14,chemical investigation of hassium element 108,2002.0,u w kirbach,23311,2273790572,2002.0
4,2006519711,569306176,148283060.0,16,chemical and nuclear studies of hassium and el...,2004.0,u w kirbach,23311,2273790572,2002.0


In [82]:
# Now let us filter df_authors to only include those records that are relevant and then extract author names
df_relevant_authors_mag = df_authors_mag[df_authors_mag['Record ID'].isin(df_repeated_offenders['Record ID'])]

df_relevant_authors_mag.head()

Unnamed: 0,MAGPID,MAGAID,MAGAffID,MAGAuthorOrder,MAGTitle,MAGPubYear,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear
1046,14792,2668341939,,9,the effect of mosapride on quality of life in ...,2004.0,sang young seol,1493,2111438153,2015.0
1047,38332783,2668341939,,3,efficacy and safety of albis r in acute and ch...,2011.0,sang young seol,1493,2111438153,2015.0
1048,72008540,2668341939,,8,biliary duodenal fistula following radiofreque...,2008.0,sang young seol,1493,2111438153,2015.0
1049,934525158,2668341939,,5,tu1672 evaluation of surface microvascular and...,2012.0,sang young seol,1493,2111438153,2015.0
1050,1919447194,2668341939,104338594.0,7,randomized controlled multi center trial compa...,2015.0,sang young seol,1493,2111438153,2015.0


In [91]:
# Now for each author in MAG, we shall see if that author is in RW repeated offenders list
# If it is, then we record that author to be flagged.
# Later we will do the flagging in the df_authors_mag

rw_to_mag_mapping = {}

for repeated_offender in offender_to_records.keys():
    
    # Extracting the records for this offender
    relevant_records = offender_to_records.get(repeated_offender)
    
    # Let us only extract records for this offender in MAG
    relevant_authors_mag = df_relevant_authors_mag[df_relevant_authors_mag['Record ID'].isin(relevant_records)]\
                        ['MAGAuthorName'].unique()
    
    # For each author in reelvant MAG, we will identify the top 3
    choices_in_mag = process.extract(repeated_offender, relevant_authors_mag, 
                                     limit=1, score_cutoff=90)
    
    if(len(choices_in_mag) != 0):
        print(repeated_offender, choices_in_mag)
        rw_to_mag_mapping[repeated_offender] = choices_in_mag[0][0]


a abdel motelib [('a abdel motelib', 100.0, 1)]
a anthony [('a anthony', 100.0, 1)]
a antony joseph [('a antony joseph', 100.0, 1)]
a c jesudoss prabhakaran [('ac jesudoss prabhakaran', 97.87234042553192, 0)]
a gattoni [('a gattoni', 100.0, 1)]
a giolis [('a giolis', 100.0, 9)]
a harkavyi [('a harkavyi', 100.0, 6)]
a iordache [('a iordache', 100.0, 6)]
a jake demetris [('a jake demetris', 100.0, 7)]
a k gupta [('a k gupta', 100.0, 8)]
a m k el-ghonemy [('a m k elghonemy', 96.7741935483871, 0)]
a n al-isa [('a n alisa', 94.73684210526316, 0)]
a parlato [('a parlato', 100.0, 3)]
a r m yusoff [('a r m yusoff', 100.0, 2)]
a rajendran [('a rajendran', 100.0, 3)]
a simon [('a simon', 100.0, 4)]
a venkata rao [('a venkata rao', 100.0, 5)]
aadithya b urs [('aadithya b urs', 100.0, 2)]
aaron s dumont [('aaron s dumont', 100.0, 1)]
abdeladhim ben abdeladhim [('abdeladhim ben abdeladhim', 100.0, 4)]
abdelilah chaoui [('abdelilah chaoui', 100.0, 3)]
abderrahman abdelkefi [('abderrahman abdelkefi',

antonella trombetta [('antonella trombetta', 100.0, 4)]
antonello accurso [('antonello accurso', 100.0, 7)]
antonio gualberto [('antonio gualberto', 100.0, 18)]
antonio jimenez-velasco [('antonio jimenezvelasco', 97.77777777777777, 11)]
antonio m gotto jr [('antonio m gotto', 95.0, 0)]
antonio morelli [('antonio morelli', 100.0, 1)]
antonio mourino [('antonio mourino', 100.0, 0)]
antonio torres [('antoni torres', 96.2962962962963, 0)]
anup k ghosh [('anup k ghosh', 100.0, 3)]
anuradha aggarwal [('anuradha aggarwal', 100.0, 6)]
anwar a hakim [('a a hakim', 95.0, 0)]
anwar alam [('anwar alam', 100.0, 0)]
aoife ahern [('aoife ahern', 100.0, 0)]
apiruck watthanasurorot [('apiruck watthanasurorot', 100.0, 6)]
ari ristimaki [('ari ristimaki', 100.0, 11)]
ariane maris gomes [('ariane maris gomes', 100.0, 3)]
ariel anguiano [('ariel anguiano', 100.0, 9)]
arnab ghosh [('arnab ghosh', 100.0, 0)]
arnulf stenzl [('arnulf stenzl', 100.0, 2)]
arpad tosaki [('arpad tosaki', 100.0, 8)]
arrigo f g cice

carlos barreiro [('carlos barreiro', 100.0, 1)]
carlos t moraes [('carlos t moraes', 100.0, 0)]
carmela galli [('carmela galli', 100.0, 5)]
caroline e ford [('caroline e ford', 100.0, 0)]
caroline h s barwood [('caroline h s barwood', 100.0, 1)]
carsten boltze [('carsten boltze', 100.0, 5)]
carsten carlberg [('carsten carlberg', 100.0, 1)]
catherine m verfaillie [('catherine m verfaillie', 100.0, 0)]
cesare greco [('cesare greco', 100.0, 1)]
chad e grueter [('chad e grueter', 100.0, 4)]
chaitanya r acharya [('chaitanya r acharya', 100.0, 11)]
chanaka seneviratne [('chanaka seneviratne', 100.0, 3)]
chandana haldar [('chandana haldar', 100.0, 0)]
chandana mohanty [('chandana mohanty', 100.0, 1)]
chang han [('chang han', 100.0, 3)]
chang jin park [('chang jin park', 100.0, 3)]
chang liu [('chang liu', 100.0, 4)]
chang suk han [('chang suk han', 100.0, 0)]
chang zheng [('chang zheng', 100.0, 4)]
chang-fen huang [('changfen huang', 96.55172413793103, 1)]
chang-jin park [('changjin park', 96

david m ojcius [('david m ojcius', 100.0, 4)]
david n brindley [('david n brindley', 100.0, 0)]
david p siderovski [('david p siderovski', 100.0, 0)]
david paul hurlstone [('david p hurlstone', 91.89189189189189, 5)]
david s egilman [('david egilman', 95.0, 0)]
david s feldman [('david s feldman', 100.0, 5)]
david s hsu [('david s hsu', 100.0, 17)]
david s kirby [('david s kirby', 100.0, 6)]
david s latchman [('david s latchman', 100.0, 2)]
david serrano [('david serrano', 100.0, 4)]
davide romano [('davide romano', 100.0, 5)]
davood omrani [('d omrani', 90.0, 2)]
dawei gou [('dawei gou', 100.0, 3)]
debasish mishra [('debasish mishra', 100.0, 5)]
debasmita mandal [('debasmita mandal', 100.0, 1)]
debi p nayak [('debi p nayak', 100.0, 0)]
deborah a cory-slechta [('deborah a coryslechta', 97.67441860465115, 1)]
debra trampe [('debra trampe', 100.0, 4)]
deepak damodaran [('deepak damodaran', 100.0, 13)]
deepak garg [('deepak garg', 100.0, 0)]
deepak passi [('deepak passi', 100.0, 1)]
deepi

felix berger [('felix berger', 100.0, 0)]
fen liu [('fen liu', 100.0, 8)]
feng bai [('feng bai', 100.0, 7)]
feng gao [('feng gao', 100.0, 0)]
feng han [('feng han', 100.0, 2)]
feng hu [('feng hu', 100.0, 4)]
feng lan lou [('fenglan lou', 95.65217391304348, 4)]
feng li [('feng li', 100.0, 1)]
feng lin cao [('fenglin cao', 95.65217391304348, 3)]
feng liu [('feng liu', 100.0, 6)]
feng lu [('feng lu', 100.0, 1)]
feng tian [('feng tian', 100.0, 2)]
feng wang [('feng wang', 100.0, 0)]
feng zhang [('feng zhang', 100.0, 6)]
feng zhao [('feng zhao', 100.0, 3)]
fenghong huang [('fenghong huang', 100.0, 3)]
ferdinand frauscher [('ferdinand frauscher', 100.0, 1)]
ferdinando de vita [('ferdinando de vita', 100.0, 2)]
fereidoun mahboudi [('fereidoun mahboudi', 100.0, 2)]
ferenc reinhardt [('ferenc reinhardt', 100.0, 4)]
fernando m de souza [('fernando de souza', 95.0, 1)]
feroz khan [('feroz khan', 100.0, 0)]
florence jay [('florence jay', 100.0, 5)]
fousseyni s toure [('fousseyni s toure', 100.0, 2

hannah r vasanthi [('hannah r vasanthi', 100.0, 1)]
hannes strasser [('hannes strasser', 100.0, 4)]
hans krause [('hans krause', 100.0, 7)]
hans p op de beeck [('hans op de beeck', 95.0, 1)]
hans-henning eckstein [('hanshenning eckstein', 97.5609756097561, 0)]
hans-jochen decker [('hansjochen decker', 97.14285714285714, 1)]
hao wang [('hao wang', 100.0, 0)]
hao wu [('hao wu', 100.0, 8)]
hao xu [('hao xu', 100.0, 5)]
hao zhang [('zhang hao', 95.0, 2)]
hapipah mohd ali [('hapipah mohd ali', 100.0, 0)]
hari k koul [('hari k koul', 100.0, 0)]
harish kumar [('harish kumar', 100.0, 0)]
harish s hosalkar [('harish s hosalkar', 100.0, 0)]
harry b m van de wiel [('harry b m van de wiel', 100.0, 4)]
harsh pal bais [('harsh p bais', 92.3076923076923, 0)]
haruhiko takada [('haruhiko takada', 100.0, 2)]
haruko obokata [('haruko obokata', 100.0, 7)]
hassan h errihani [('hassan errihani', 95.0, 1)]
hassan sabzyan [('hassan sabzyan', 100.0, 1)]
hassan yousefnia [('hassan yousefnia', 100.0, 1)]
hassen 

inge czaja [('inge czaja', 100.0, 19)]
ingo eitel [('ingo eitel', 100.0, 0)]
ioannis a kyriazis [('ioannis kyriazis', 95.0, 6)]
ionel chirica [('ionel chirica', 100.0, 0)]
irfan rahman [('irfan rahman', 100.0, 1)]
irun r cohen [('irun r cohen', 100.0, 3)]
isa abdi-rad [('isa abdirad', 95.65217391304348, 4)]
isao miyakawa [('isao miyakawa', 100.0, 2)]
isao yamaguchi [('isao yamaguchi', 100.0, 0)]
ishwarlal jialal [('ishwarlal jialal', 100.0, 0)]
isidoro di carlo [('isidoro di carlo', 100.0, 0)]
isidre ferrer [('isidro ferrer', 92.3076923076923, 0)]
ismail dogan [('ismail dogan', 100.0, 3)]
ismail essadi [('ismail essadi', 100.0, 4)]
ismail yusoff [('ismail yusoff', 100.0, 1)]
issei komuro [('issei komuro', 100.0, 0)]
istvan lekli [('istvan lekli', 100.0, 8)]
italas george [('italas george', 100.0, 3)]
itamar raz [('itamar raz', 100.0, 1)]
itaru nitta [('itaru nitta', 100.0, 2)]
itzhak herz [('itzhak herz', 100.0, 3)]
ivan radovanovic [('ivan radovanovic', 100.0, 6)]
ivana de domenico [(

jing zhao [('jing zhao', 100.0, 5)]
jing zhou [('jing zhou', 100.0, 3)]
jingjun zhao [('jingjun zhao', 100.0, 0)]
jingru shao [('jingru shao', 100.0, 6)]
jingxia xu [('jingxia xu', 100.0, 2)]
jingyao qi [('jingyao qi', 100.0, 2)]
jinmai jiang [('jinmai jiang', 100.0, 3)]
jinming zhang [('jinming', 90.0, 3)]
jinseu park [('jinseu park', 100.0, 4)]
jinsong wu [('jinsong wu', 100.0, 0)]
jiping qi [('jiping qi', 100.0, 5)]
jirka peschek [('jirka peschek', 100.0, 0)]
jiro fujita [('jiro fujita', 100.0, 2)]
jitendriya mishra [('jitendriya mishra', 100.0, 0)]
jiyeon s kim [('jiyeon s kim', 100.0, 8)]
jnyanaranjan panda [('jnyanaranjan panda', 100.0, 2)]
jo h m berden [('jo h m berden', 100.0, 3)]
joab chapman [('joab chapman', 100.0, 1)]
joachim fandrey [('joachim fandrey', 100.0, 0)]
joana r feliciano [('joana r feliciano', 100.0, 3)]
joanna w y ho [('joanna w ho', 95.0, 7)]
jocelyn i dudley [('jocelyn i dudley', 100.0, 5)]
jochen gehrmann [('jochen gehrmann', 100.0, 2)]
joel a black [('joel

katsuya amano [('katsuya amano', 100.0, 3)]
katsuya maruyama [('katsuya maruyama', 100.0, 6)]
kaushik bhattacharya [('kaushik bhattacharya', 100.0, 2)]
kavita kirankumar patel [('kavita kirankumar patel', 100.0, 3)]
kazem barati [('kazem barati', 100.0, 7)]
kazuhiro hasezaki [('kazuhiro hasezaki', 100.0, 0)]
kazuhiro ito [('kazuhiro ito', 100.0, 6)]
kazuhiro tanaka [('kazuhiro tanaka', 100.0, 5)]
kazuhiro yoshida [('kazuhiro yoshida', 100.0, 1)]
kazuiku ohshiro [('kazuiku ohshiro', 100.0, 9)]
kazumi akimoto [('kazumi akimoto', 100.0, 3)]
kazunari taira [('kazunari taira', 100.0, 0)]
kazuya kondo [('kazuya kondo', 100.0, 1)]
kazuya shiogama [('kazuya shiogama', 100.0, 1)]
kazuyoshi yamaoka [('kazuyoshi yamaoka', 100.0, 4)]
ke wang [('ke wang', 100.0, 4)]
kei satoh [('kei satoh', 100.0, 1)]
keiji morokuma [('keiji morokuma', 100.0, 9)]
keiji ueno [('keiji ueno', 100.0, 1)]
keisuke amaha [('keisuke amaha', 100.0, 0)]
keith o hodgson [('keith o hodgson', 100.0, 5)]
keith p choe [('keith p 

libin wu [('libin wu', 100.0, 13)]
libo yao [('libo yao', 100.0, 1)]
lidija l radenovic [('lidija radenovic', 95.0, 0)]
lieping chen [('lieping chen', 100.0, 2)]
lihong xu [('lihong xu', 100.0, 0)]
lihua wang [('lihua wang', 100.0, 10)]
lijuan wang [('lijuan wang', 100.0, 4)]
lijun wang [('lijun wang', 100.0, 4)]
lili liu [('lili liu', 100.0, 9)]
lili wang [('lili wang', 100.0, 6)]
liliana haversen [('liliana haversen', 100.0, 6)]
limao wu [('limao wu', 100.0, 3)]
limin wang [('limin wang', 100.0, 5)]
lin chen [('li', 90.0, 1)]
lin cong [('lin cong', 100.0, 4)]
lin fang [('lin fang', 100.0, 3)]
lin hao [('lin hao', 100.0, 4)]
lin li [('lin li', 100.0, 1)]
lin liu [('lin liu', 100.0, 1)]
lin ma [('lin ma', 100.0, 0)]
lin song [('lin song', 100.0, 0)]
lin sun [('lin sun', 100.0, 2)]
lin wang [('lin wang', 100.0, 5)]
lin xun [('lin xun', 100.0, 2)]
lin zhang [('lin zhang', 100.0, 3)]
lina wang [('lina wang', 100.0, 5)]
linda b buck [('linda b buck', 100.0, 1)]
linda garland [('linda l gar

masahiro hayashi [('masahiro hayashi', 100.0, 0)]
masahiro nakayama [('masahiro nakayama', 100.0, 1)]
masakazu nishida [('masakazu nishida', 100.0, 4)]
masaki tanaka [('masaki tanaka', 100.0, 2)]
masaki watanabe [('masaki watanabe', 100.0, 7)]
masaki yoshida [('masaki yoshida', 100.0, 0)]
masanori naitou [('masanori naitou', 100.0, 11)]
masanori takaoka [('masanori takaoka', 100.0, 3)]
masashi imai [('masashi imai', 100.0, 0)]
masato masuda [('masato masuda', 100.0, 6)]
masatomo mihara [('masatomo mihara', 100.0, 27)]
masayo morita [('masayo morita', 100.0, 9)]
masayuki otani [('masayuki otani', 100.0, 3)]
masayuki tadano [('masayuki tadano', 100.0, 11)]
masayuki yamato [('masayuki yamato', 100.0, 7)]
masayuki yoshida [('masayuki yoshida', 100.0, 9)]
masayuki yoshioka [('masayuki yoshioka', 100.0, 7)]
masoud amiri [('masoud amiri', 100.0, 1)]
masoud hashemi [('masoud hashemi', 100.0, 0)]
masuko ushio-fukai [('masuko ushiofukai', 97.14285714285714, 0)]
mathew sharpe [('mathew sharpe', 1

mutsuko ito [('mutsuko ito', 100.0, 2)]
n m kamble [('n m kamble', 100.0, 4)]
n ramesh [('n ramesh', 100.0, 2)]
n rampersaud [('n rampersaud', 100.0, 3)]
na wang [('na wang', 100.0, 11)]
na wei [('na wei', 100.0, 1)]
na wu [('na wu', 100.0, 1)]
na zhang [('na zhang', 100.0, 2)]
nabil e el wakeil [('nabil elwakeil', 90.32258064516128, 0)]
nadav sorek [('nadav sorek', 100.0, 2)]
nader salama [('nader salama', 100.0, 3)]
nadine s sauter [('nadine s sauter', 100.0, 6)]
nagalingam r sundaresan [('nagalingam r sundaresan', 100.0, 0)]
naheed banu [('naheed banu', 100.0, 6)]
nahrizul adib kadri [('nahrizul adib kadri', 100.0, 0)]
naji nawfal [('naji nawfal', 100.0, 3)]
nam deuk kim [('nam deuk kim', 100.0, 0)]
namagiri sirishkumar [('namagiri sirishkumar', 100.0, 6)]
nan jiang [('nan jiang', 100.0, 4)]
nan li [('nan li', 100.0, 1)]
nan lu [('nan lu', 100.0, 3)]
nan shi [('nan shi', 100.0, 32)]
nancy a speck [('nancy a speck', 100.0, 3)]
nanette k wenger [('nanette k wenger', 100.0, 12)]
naoaki

philippe yaba [('philippe yaba', 100.0, 5)]
phillip g febbo [('phillip g febbo', 100.0, 13)]
phyllis y reaves [('phyllis y reaves', 100.0, 5)]
piercarlo sarzi-puttini [('piercarlo sarziputtini', 97.77777777777777, 1)]
pierluigi granone [('pierluigi granone', 100.0, 3)]
piero anversa [('piero anversa', 100.0, 0)]
piero del soldato [('piero del soldato', 100.0, 9)]
pietro lombardi [('pietro lombardi', 100.0, 0)]
pikul jiravanichpaisal [('pikul jiravanichpaisal', 100.0, 3)]
ping gao [('ping gao', 100.0, 5)]
ping li [('ping li', 100.0, 2)]
ping wang [('ping wang', 100.0, 1)]
ping yang [('ping yang', 100.0, 2)]
ping ye [('ping ye', 100.0, 3)]
ping zhang [('ping zhang', 100.0, 2)]
ping zhao [('ping zhao', 100.0, 4)]
ping zhou [('ping zhou', 100.0, 0)]
pingmei guo [('pingmei guo', 100.0, 6)]
pir abdul rasool qureshi [('pir abdul rasool qureshi', 100.0, 6)]
pongali b raghavendra [('pongali b raghavendra', 100.0, 3)]
pontus almer bostrom [('pontus bostrom', 95.0, 2)]
pooja sharma [('pooja sharm

rong li [('rong li', 100.0, 1)]
rong liu [('rong liu', 100.0, 3)]
rong xu [('rong xu', 100.0, 6)]
rong zhang [('rong zhang', 100.0, 9)]
rongxiu li [('rongxiu li', 100.0, 6)]
ronnie g p wismans [('ronnie g wismans', 95.0, 8)]
rony seger [('rony seger', 100.0, 5)]
rosa m quinta-ferreira [('rosa m quintaferreira', 97.67441860465115, 2)]
rosa visone [('rosa visone', 100.0, 11)]
rosalind romeo-meeuw [('rosalind romeomeeuw', 97.43589743589743, 9)]
rosana l sernaglia [('rosana l sernaglia', 100.0, 2)]
ross w harrington [('ross w harrington', 100.0, 1)]
rossen m donev [('rossen m donev', 100.0, 1)]
roya kelishadi [('roya kelishadi', 100.0, 0)]
rudiger ettrich [('rudiger ettrich', 100.0, 12)]
rudolf hohenfellner [('rudolf hohenfellner', 100.0, 0)]
rui cao [('rui cao', 100.0, 18)]
rui chen [('rui chen', 100.0, 2)]
rui curi [('rui curi', 100.0, 0)]
rui jiang [('rui jiang', 100.0, 0)]
rui peng [('rui peng', 100.0, 1)]
rui zhang [('rui zhang', 100.0, 1)]
rui-qi wang [('ruiqi wang', 95.2380952380952

shihhsin lu [('shihhsin lu', 100.0, 0)]
shin yong moon [('shin yong moon', 100.0, 3)]
shin young jeong [('shin young jeong', 100.0, 1)]
shinichi toyooka [('shinichi toyooka', 100.0, 7)]
shinichiro takezawa [('shinichiro takezawa', 100.0, 29)]
shinji osada [('shinji osada', 100.0, 3)]
shinji takahashi [('shinji takahashi', 100.0, 3)]
shinji teramoto [('shinji teramoto', 100.0, 1)]
shinzo kimura [('shinzo kimura', 100.0, 3)]
shirong wen [('shirong wen', 100.0, 3)]
shiying wang [('shiying wang', 100.0, 1)]
shizuo akira [('shizuo akira', 100.0, 0)]
shlomo dagan [('shlomo dagan', 100.0, 9)]
shoko makishi [('shoko makishi', 100.0, 3)]
shoukun wu [('shoukun wu', 100.0, 3)]
shousen wang [('shousen wang', 100.0, 2)]
shouwei han [('shouwei han', 100.0, 8)]
shripad n pal [('shripad n pal', 100.0, 0)]
shu xuan deng [('shuxuan deng', 96.0, 1)]
shucui jiang [('shucui jiang', 100.0, 7)]
shuguang wang [('shuguang wang', 100.0, 1)]
shui xi fu [('shui xi fu', 100.0, 23)]
shuk-ching ho [('shukching ho', 

tanzila saba [('tanzila saba', 100.0, 0)]
tao chen [('tao chen', 100.0, 1)]
tao huang [('tao huang', 100.0, 3)]
tao jiang [('tao jiang', 100.0, 3)]
tao jin [('tao jin', 100.0, 2)]
tao li [('tao li', 100.0, 1)]
tao peng [('tao peng', 100.0, 0)]
tao tao [('tao tao', 100.0, 2)]
tao wang [('tao wang', 100.0, 0)]
tao yang [('tao yang', 100.0, 4)]
tao yu [('tao yu', 100.0, 1)]
tao zhang [('tao zhang', 100.0, 9)]
tapan k audhya [('tapan audhya', 95.0, 6)]
tarek ben othman [('tarek ben othman', 100.0, 9)]
tarek shokeir [('tarek shokeir', 100.0, 0)]
tarun narang [('tarun narang', 100.0, 2)]
tatiana syrovets [('tatiana syrovets', 100.0, 2)]
tatjana degenhardt [('tatjana degenhardt', 100.0, 9)]
teizo yoshimura [('teizo yoshimura', 100.0, 6)]
tej p singh [('tej singh', 95.0, 1)]
tengfei bao [('tengfei bao', 100.0, 2)]
teresa kasprzycka-guttman [('teresa kasprzyckaguttman', 97.95918367346938, 8)]
teresa maria cierco [('teresa cierco', 95.0, 0)]
teresa valentino [('teresa valentino', 100.0, 8)]
tere

weihua zhang [('weihua zhang', 100.0, 7)]
weijun xiong [('weijun xiong', 100.0, 3)]
weiquan li [('weiquan li', 100.0, 1)]
weishui y weiser [('weishui y weiser', 100.0, 7)]
weiyi fang [('weiyi fang', 100.0, 3)]
weiyi huang [('weiyi huang', 100.0, 0)]
weiyun shi [('weiyun shi', 100.0, 2)]
wen li [('wen li', 100.0, 4)]
wen rui su [('wen rui su', 100.0, 10)]
wendie a robbins [('wendie a robbins', 100.0, 1)]
wenfeng li [('wenfeng li', 100.0, 2)]
wenjun wang [('wenjun wang', 100.0, 1)]
wentao wang [('wentao wang', 100.0, 1)]
wenyan fu [('wenyan fu', 100.0, 3)]
wenyan liao [('wenyan liao', 100.0, 4)]
wenzhu li [('wenzhu', 90.0, 4)]
wilhelm k aicher [('wilhelm k aicher', 100.0, 5)]
willem koomen [('willem koomen', 100.0, 2)]
william a simmons [('william a simmons', 100.0, 1)]
william b campbell [('william b campbell', 100.0, 0)]
william clegg [('william clegg', 100.0, 0)]
william h frey ii [('william h frey', 95.0, 8)]
william horne [('william c horne', 95.0, 1)]
william j cook [('william j co

ye tian [('ye tian', 100.0, 6)]
ye xi [('ye xi', 100.0, 8)]
ye-shih ho [('yeshih ho', 94.73684210526316, 4)]
yeon-kyun shin [('yeonkyun shin', 96.2962962962963, 0)]
yeoung gyu ko [('yeounggyu ko', 96.0, 4)]
yi chen [('yi chen', 100.0, 4)]
yi guo [('yi guo', 100.0, 7)]
yi liu [('yi liu', 100.0, 10)]
yi song [('yi song', 100.0, 1)]
yi wan [('yi wan', 100.0, 0)]
yi zhao [('yi zhao', 100.0, 8)]
yi-he ling [('yi he ling', 90.0, 9)]
yi-rui liang [('yirui liang', 95.65217391304348, 2)]
yifang chen [('yifang chen', 100.0, 0)]
yigong fu [('yigong fu', 100.0, 9)]
yijie zhang [('yijie zhang', 100.0, 0)]
yikui li [('yikui li', 100.0, 7)]
yin li [('yin li', 100.0, 2)]
ying chen [('ying chen', 100.0, 11)]
ying du [('ying du', 100.0, 1)]
ying hu [('ying hu', 100.0, 2)]
ying huang [('ying huang', 100.0, 3)]
ying jiang [('ying jiang', 100.0, 7)]
ying li [('ying li', 100.0, 1)]
ying ren [('ren ying', 95.0, 4)]
ying tang [('ying tang', 100.0, 3)]
ying wang [('ying wang', 100.0, 23)]
ying wu [('ying wu', 

In [93]:
# Checking how many were mapped out of how many

print(f"Out of {len(offender_to_records.keys())} repeated offenders identified in RW between x to 2015, "\
          f"{len(rw_to_mag_mapping)} were mapped in MAG based on same Record ID and fuzzy matching score > 90")


Out of 3917 repeated offenders identified in RW between x to 2015, 3397 were mapped in MAG based on same Record ID and fuzzy matching score > 90


In [97]:
# Now we create a dataframe out of it

df_repeated_offender_matching = pd.DataFrame(list(rw_to_mag_mapping.items()), 
                                             columns=['AuthorNameNorm','MAGAuthorName'])

# For now let us just remove duplicates randomly

df_repeated_offender_matching = df_repeated_offender_matching.drop_duplicates(subset='MAGAuthorName', keep='first')

df_repeated_offender_matching

Unnamed: 0,AuthorNameNorm,MAGAuthorName
0,a abdel motelib,a abdel motelib
1,a anthony,a anthony
2,a antony joseph,a antony joseph
3,a c jesudoss prabhakaran,ac jesudoss prabhakaran
4,a gattoni,a gattoni
...,...,...
3392,zonghua wang,zonghua wang
3393,zongxi han,zongxi han
3394,zu-hua gao,zuhua gao
3395,zuoren wang,zuoren wang


In [98]:
df_repeated_offender_matching[df_repeated_offender_matching['AuthorNameNorm'].duplicated(keep=False)]

Unnamed: 0,AuthorNameNorm,MAGAuthorName


In [99]:
df_repeated_offender_matching[df_repeated_offender_matching['MAGAuthorName'].duplicated(keep=False)]\
        .sort_values(by='MAGAuthorName')

Unnamed: 0,AuthorNameNorm,MAGAuthorName


In [69]:
df_authors_mag2 = df_authors_mag.merge(df_repeated_offender_matching, on='MAGAuthorName',
                                      how='left')
df_authors_mag2.drop_duplicates()

Unnamed: 0,MAGPID,MAGAID,MAGAffID,MAGAuthorOrder,MAGTitle,MAGPubYear,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,AuthorNameNorm
0,524,569306176,148283060.0,1,the cryo thermochromatographic separator cts a...,2002.0,u w kirbach,23311,2273790572,2002.0,
1,1598005161,569306176,148283060.0,8,non observation of the production of superheav...,2002.0,u w kirbach,23311,2273790572,2002.0,
2,1968637420,569306176,,5,publisher s note confirmation of production of...,2003.0,u w kirbach,23311,2273790572,2002.0,
3,1989365649,569306176,148283060.0,14,chemical investigation of hassium element 108,2002.0,u w kirbach,23311,2273790572,2002.0,
4,2006519711,569306176,148283060.0,16,chemical and nuclear studies of hassium and el...,2004.0,u w kirbach,23311,2273790572,2002.0,
...,...,...,...,...,...,...,...,...,...,...,...
2858059,3168732224,3170455413,,4,development and application of intensified env...,2014.0,하정민,6759,3168732224,2015.0,
2858060,3168732224,3171097566,,3,development and application of intensified env...,2014.0,,6759,3168732224,2015.0,
2858061,3168732224,3171812317,,1,development and application of intensified env...,2014.0,youngsu an,6759,3168732224,2015.0,
2858062,3168732224,3172678465,,7,development and application of intensified env...,2014.0,noh jungpil,6759,3168732224,2015.0,


In [70]:
df_authors_mag.shape

(2841741, 10)