# Repeated Offenders in MAG

In this notebook, we shall use repeated offenders extracted from Retraction Watch (RW) to identify them in MAG.

We will do so as follows:

1. Read the repeated offenders we identified in RW
2. Read the authors we identified from MAG
3. Identify the records for each repeated offenders
4. For each identified record, check for those offenders in the matched records in MAG using exact + fuzzy matching
5. If the author is identified based on fuzzy score > 90, we flag that author as a repeated offender
6. We shall save an authors file that is totally disambiguated and has following columns:
    
    a. MAGAID
    
    b. MAGPID
    
    c. Record ID
    
    d. FirstRetractionYear
    
    e. OffenseSameYear
    
    f. nRetracted

In [51]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config
from rapidfuzz import process, fuzz

In [52]:
# Reading paths
paths = read_config()
PROCESSED_REPEATED_OFFENDERS = paths['PROCESSED_REPEATED_OFFENDERS']
PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH = paths['PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [53]:
# Step 1
# This file contains repeated offenders identified from RW
df_repeated_offenders = pd.read_csv(PROCESSED_REPEATED_OFFENDERS)
# Normalizing the author name (though this is redundant as I normalized it in earlier files)
df_repeated_offenders['AuthorNameNorm'] = df_repeated_offenders['AuthorName'].str.lower()
df_repeated_offenders.head()

Unnamed: 0,Record ID,AuthorName,nRetracted,FirstRetractionYear,OffenseSameYear,AuthorNameNorm
0,25214,a a nesterenko,3,2020,True,a a nesterenko
1,25101,a a nesterenko,3,2020,True,a a nesterenko
2,25083,a a nesterenko,3,2020,True,a a nesterenko
3,17742,a abdel motelib,2,2015,True,a abdel motelib
4,17766,a abdel motelib,2,2015,True,a abdel motelib


In [54]:
# These also include authors from after 2015


print(f"Number of offenders identified using names: "\
      f"{df_repeated_offenders['AuthorName'].nunique()},"\
      f"\nDistribution by whether retracted in same year or not"\
      f"{df_repeated_offenders.drop_duplicates(subset='AuthorName')['OffenseSameYear'].value_counts()}")

Number of offenders identified using names: 6591,
Distribution by whether retracted in same year or notOffenseSameYear
True     3616
False    2975
Name: count, dtype: int64


In [55]:
# Let us create a dictionary from repeated_offenders to all the record ids
# Group by AuthorName and aggregate Record IDs into a list
offender_to_records = df_repeated_offenders.groupby('AuthorNameNorm')['Record ID'].apply(list).to_dict()

offender_to_records

{'a a nesterenko': [25214, 25101, 25083],
 'a a zafar': [16742, 8462],
 'a abdel motelib': [17742, 17766],
 'a abdul ajees': [123, 18588],
 'a abou-elela': [4541, 4630],
 'a anthony': [4036, 1050],
 'a antony joseph': [17262, 2877],
 'a azhagappan': [20611, 21523],
 'a b (ð\x90 ð‘) martynushkin (ðœð°ñ€ñ‚ñ‹ð½ñƒñˆðºð¸ð½)': [25122, 25121],
 'a baradaran-rafii': [4790, 4789],
 'a birkan selcuk': [18288, 18287, 18289],
 'a c jesudoss prabhakaran': [4136, 4294],
 'a d nkengafac': [7869, 7871],
 'a dimitrios colevas': [20571, 20570],
 'a g koshchaev': [25215, 25216, 25214, 25083],
 'a gattoni': [3383, 3382],
 'a giolis': [2865, 2510],
 'a harkavyi': [1543, 2901],
 'a i volokitin': [21992, 21994, 21993, 22753],
 'a iordache': [4526, 4527],
 'a jafari': [16762, 18047],
 'a jake demetris': [3462, 6605],
 'a k gupta': [19185, 7764, 820],
 'a kumar': [3537, 3742],
 'a m (a m) mikhaleva (ðœð¸ñ…ð°ð»ðµð²ð°)': [25558, 25557],
 'a m k el-ghonemy': [5267, 18225, 18192, 18208, 5152],
 'a n al-isa': [1792

In [56]:
# Step 2: Get all the authors that we identified from merging papers with MAG
df_authors_mag = pd.read_csv(PROCESSED_RETRACTED_AUTHOR_HISTORIES_LOCAL_PATH,
                            usecols=['MAGAID','RetractedPaperMAGPID',
                                    'MAGAuthorName','Record ID', 'RetractionYear'])\
                            .drop_duplicates()

df_authors_mag

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear
0,569306176,u w kirbach,23311,2273790572,2002.0
15,2230134732,k e gregorich,23311,2273790572,2002.0
282,2423554807,h nitsche,23311,2273790572,2002.0
575,2462990842,p a wilk,23311,2273790572,2002.0
666,2477911198,d c hoffman,23311,2273790572,2002.0
...,...,...,...,...,...
2841736,3170455413,하정민,6759,3168732224,2015.0
2841737,3171097566,,6759,3168732224,2015.0
2841738,3171812317,youngsu an,6759,3168732224,2015.0
2841739,3172678465,noh jungpil,6759,3168732224,2015.0


In [57]:
df_authors_mag['MAGAID'].nunique() # Because of repeated offenders

23620

In [58]:
# Let us create a list of repeated offenders based on MAGAID
# We shall then see which MAGAIDs are not dealt with from RW name matching, and deal with those later

df_offenders_per_MAGAID = df_authors_mag.groupby('MAGAID')['Record ID'].nunique().reset_index()\
                                .rename(columns={'Record ID':'NumRecords'})
repeated_offenders_per_MAGAID = df_offenders_per_MAGAID[df_offenders_per_MAGAID['NumRecords'].gt(1)]['MAGAID'].unique()


len(repeated_offenders_per_MAGAID) # All these should be dealt with


2284

In [59]:
# Now let us filter df_authors to only include those records that are relevant and then extract author names
# In other words, we are only getting for now the authors of records of RW for repeated offenders

df_relevant_authors_mag = df_authors_mag[df_authors_mag['Record ID'].isin(df_repeated_offenders['Record ID'])]

df_relevant_authors_mag.head()

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear
1046,2668341939,sang young seol,1493,2111438153,2015.0
1970,666554414,antonio lopezbeltran,2991,1894242312,2014.0
2740,1293640941,rodolfo montironi,2991,1894242312,2014.0
3748,2161285743,liang cheng,2991,1894242312,2014.0
3749,2161285743,liang cheng,22881,2011983027,2013.0


In [60]:
# Now for each author in MAG, we shall see if that author is in RW repeated offenders list
# If it is, then we record that author to be flagged.
# Since we do not have author ids, we will use fuzzy matching on names
# Later we will do the flagging in the df_authors_mag

rw_to_mag_mapping = {}

for repeated_offender in offender_to_records.keys():
    
    # Extracting the records for this offender
    relevant_records = offender_to_records.get(repeated_offender)
    
    # Let us only extract records for this offender in MAG
    relevant_authors_mag = df_relevant_authors_mag[df_relevant_authors_mag['Record ID'].isin(relevant_records)]\
                        ['MAGAuthorName'].unique()
    
    # For each author in reelvant MAG, we will identify the top 3
    choices_in_mag = process.extract(repeated_offender, relevant_authors_mag, 
                                     limit=1, score_cutoff=90)
    
    if(len(choices_in_mag) != 0):
        print(repeated_offender, choices_in_mag)
        rw_to_mag_mapping[repeated_offender] = choices_in_mag[0][0]


a abdel motelib [('a abdel motelib', 100.0, 1)]
a anthony [('a anthony', 100.0, 1)]
a antony joseph [('a antony joseph', 100.0, 1)]
a c jesudoss prabhakaran [('ac jesudoss prabhakaran', 97.87234042553192, 0)]
a gattoni [('a gattoni', 100.0, 1)]
a giolis [('a giolis', 100.0, 9)]
a harkavyi [('a harkavyi', 100.0, 6)]
a iordache [('a iordache', 100.0, 6)]
a jake demetris [('a jake demetris', 100.0, 7)]
a k gupta [('a k gupta', 100.0, 8)]
a m k el-ghonemy [('a m k elghonemy', 96.7741935483871, 0)]
a n al-isa [('a n alisa', 94.73684210526316, 0)]
a parlato [('a parlato', 100.0, 3)]
a r m yusoff [('a r m yusoff', 100.0, 2)]
a rajendran [('a rajendran', 100.0, 3)]
a simon [('a simon', 100.0, 4)]
a venkata rao [('a venkata rao', 100.0, 5)]
aadithya b urs [('aadithya b urs', 100.0, 2)]
aaron s dumont [('aaron s dumont', 100.0, 1)]
abdeladhim ben abdeladhim [('abdeladhim ben abdeladhim', 100.0, 4)]
abdelilah chaoui [('abdelilah chaoui', 100.0, 3)]
abderrahman abdelkefi [('abderrahman abdelkefi',

brian g spratt [('brian g spratt', 100.0, 2)]
brian i rini [('brian i rini', 100.0, 2)]
brian r leaker [('brian leaker', 95.0, 10)]
brian scott peskin [('brian scott peskin', 100.0, 1)]
bridgette martin hard [('bridgette martin hard', 100.0, 1)]
britt hedman [('britt hedman', 100.0, 10)]
bronwyn a kingwell [('bronwyn a kingwell', 100.0, 0)]
bruce caterson [('bruce caterson', 100.0, 4)]
bruce e kemp [('bruce e kemp', 100.0, 4)]
bruce e murdoch [('bruce e murdoch', 100.0, 0)]
bruce j avolio [('bruce j avolio', 100.0, 0)]
bruce m damon [('bruce m damon', 100.0, 0)]
bruna pucci [('bruna pucci', 100.0, 2)]
bruno amato [('bruno amato', 100.0, 1)]
byeong chun lee [('byeong chun lee', 100.0, 4)]
byoung chul cho [('byoung chul cho', 100.0, 4)]
byron wingerd [('byron a wingerd', 95.0, 2)]
c chibelean [('c chibelean', 100.0, 2)]
c codoiu [('c codoiu', 100.0, 10)]
c karthikeyan [('c karthikeyan', 100.0, 1)]
c mirvald [('c mirvald', 100.0, 9)]
c noel bairey merz [('c noel bairey merz', 100.0, 3)]
c

da-qiang li [('daqiang li', 95.23809523809523, 4)]
dae won kim [('dae won kim', 100.0, 12)]
daiming fan [('daiming fan', 100.0, 0)]
dale kiesewetter [('dale o kiesewetter', 95.0, 7)]
dalibor petkovic [('dalibor petkovic', 100.0, 1)]
dalibor sames [('dalibor sames', 100.0, 0)]
damo xu [('damo xu', 100.0, 3)]
dan liu [('dan liu', 100.0, 9)]
dan luo [('dan luo', 100.0, 3)]
dan wang [('dan wang', 100.0, 6)]
dan yu [('dan yu', 100.0, 5)]
dana elias [('dana elias', 100.0, 8)]
dana peled [('dana peled', 100.0, 16)]
daniel d karp [('daniel d karp', 100.0, 14)]
daniel f klessig [('daniel f klessig', 100.0, 0)]
daniel kavan [('daniel kavan', 100.0, 26)]
daniel m prevedello [('daniel m prevedello', 100.0, 3)]
daniel martinez [('daniel martinez', 100.0, 9)]
daniel nicoletti [('daniel nicoletti', 100.0, 4)]
daniel r mcneill [('daniel r mcneill', 100.0, 8)]
daniel rozbesky [('daniel rozbeský', 93.33333333333333, 9)]
daniel st johnston [('daniel st johnston', 100.0, 2)]
daniel w chan [('daniel w chan

giuseppe remuzzi [('giuseppe remuzzi', 100.0, 0)]
gizem donmez [('gizem donmez', 100.0, 2)]
gong feili [('gong feili', 100.0, 2)]
gongyou chen [('gongyou chen', 100.0, 0)]
gowthaman swaminathan [('gowthaman swaminathan', 100.0, 1)]
grant c nicholson [('grant c nicholson', 100.0, 12)]
grant w cannon [('grant w cannon', 100.0, 7)]
gregory b young [('gregory b young', 100.0, 3)]
gregory fridman [('gregory fridman', 100.0, 4)]
gregory i liou [('gregory i liou', 100.0, 3)]
gregory schott [('gregory schott', 100.0, 7)]
guang wen [('g wen', 90.0, 4)]
guang yang [('guang yang', 100.0, 7)]
guang zhi he [('guangzhi he', 95.65217391304348, 2)]
guang-chao wang [('guangchao wang', 96.55172413793103, 2)]
guanghui hu [('guanghui hu', 100.0, 1)]
guangnan luo [('guangnan luo', 100.0, 6)]
guangwen wu [('guangwen wu', 100.0, 6)]
guangxiao yang [('guangxiao yang', 100.0, 0)]
guangyuan he [('guangyuan he', 100.0, 1)]
guangzhong xiong [('guangzhong', 90.0, 0)]
gui-qiang liu [('guiqiang liu', 96.0, 1)]
guihu

jean francois bach [('jeanfrancois bach', 97.14285714285714, 3)]
jean-pierre jacquot [('jeanpierre jacquot', 97.2972972972973, 0)]
jean-yves daniel [('jeanyves daniel', 96.7741935483871, 5)]
jee in choi [('jee in choi', 100.0, 7)]
jeff schell [('jeff schell', 100.0, 1)]
jeffrey d ritzenthaler [('jeffrey d ritzenthaler', 100.0, 2)]
jeffrey m beaubien [('jeffrey m beaubien', 100.0, 7)]
jeffrey marks [('jeffrey r marks', 95.0, 4)]
jeffrey r balser [('jeffrey r balser', 100.0, 0)]
jeffrey s elmendorf [('jeffrey s elmendorf', 100.0, 3)]
jeffrey s kroin [('jeffrey s kroin', 100.0, 4)]
jennifer rieusset [('jennifer rieusset', 100.0, 10)]
jennifer rodriguez [('jennifer rodriguez', 100.0, 12)]
jennnifer s lerner [('jennifer s lerner', 97.14285714285714, 3)]
jens forster [('jens forster', 100.0, 0)]
jeong a kim [('jeong a kim', 100.0, 9)]
jeong-dong kim [('jeongdong kim', 96.2962962962963, 3)]
jeremy s parker [('jeremy s parker', 100.0, 6)]
jerome e groopman [('jerome e groopman', 100.0, 0)]
jer

kristin roovers [('kristin roovers', 100.0, 3)]
kristjen b lundberg [('kristjen b lundberg', 100.0, 2)]
krutisundar mandal [('krutisundar mandal', 100.0, 4)]
kuan huang [('kuan huang', 100.0, 6)]
kui zhu [('kui zhu', 100.0, 8)]
kun li [('kun li', 100.0, 5)]
kun liang guan [('kunliang guan', 96.2962962962963, 0)]
kunigal n shivakumar [('kunigal n shivakumar', 100.0, 0)]
kuniharu miyajima [('kuniharu miyajima', 100.0, 13)]
kunihiro matsumoto [('kunihiro matsumoto', 100.0, 5)]
kunsoo rhee [('kunsoo rhee', 100.0, 4)]
kunyu yang [('kunyu yang', 100.0, 1)]
kurt hojlund [('kurt hojlund', 100.0, 3)]
kurt kofler [('kurt kofler', 100.0, 11)]
kwok-yung yuen [('kwokyung yuen', 96.2962962962963, 3)]
kyriakos spiliopoulos [('kyriakos spiliopoulos', 100.0, 3)]
kyu lim [('kyu lim', 100.0, 4)]
kyu-han kim [('kyu han kim', 90.9090909090909, 1)]
kyung hee paek [('kyung hee paek', 100.0, 0)]
kyung sun kang [('kyungsun kang', 96.2962962962963, 0)]
kyung-jin lee [('kyungjin lee', 96.0, 1)]
l cheng [('l chen

michele schiariti [('michele schiariti', 100.0, 9)]
michelle l robbin [('michelle l robbin', 100.0, 6)]
michelle sieburg [('michelle sieburg', 100.0, 7)]
michiyo itakura [('michiyo itakura', 100.0, 1)]
mickey m martin [('mickey m martin', 100.0, 13)]
miguel maestro [('miguel a maestro', 95.0, 1)]
miki igarashi [('miki igarashi', 100.0, 3)]
mikio tomita [('mikio tomita', 100.0, 1)]
milena penkowa [('milena penkowa', 100.0, 1)]
miloslav pospisil [('and miloslav pospisil', 95.0, 21)]
min hu [('min hu', 100.0, 10)]
min ki jee [('min ki jee', 100.0, 10)]
min li [('min li', 100.0, 7)]
min liu [('min liu', 100.0, 1)]
min lu [('min lu', 100.0, 0)]
min wang [('min wang', 100.0, 4)]
min wu [('min wu', 100.0, 0)]
min yang [('min yang', 100.0, 4)]
min zhang [('min zhang', 100.0, 1)]
mina kim [('mina kim', 100.0, 10)]
ming chen [('ming chen', 100.0, 4)]
ming gao [('ming gao', 100.0, 2)]
ming li [('ming li', 100.0, 0)]
ming liu [('ming liu', 100.0, 2)]
ming shi [('ming shi', 100.0, 5)]
ming yang [('

nisha upadhyay [('nisha upadhyay', 100.0, 6)]
nishi mathur [('nishi mathur', 100.0, 3)]
nitin t aggarwal [('nitin t aggarwal', 100.0, 1)]
nobutaka eiraku [('nobutaka eiraku', 100.0, 8)]
nobuto yamamoto [('nobuto yamamoto', 100.0, 2)]
nobuyasu komazawa [('nobuyasu komazawa', 100.0, 15)]
nobuyuki takasu [('nobuyuki takasu', 100.0, 0)]
nobuyuki yamamoto [('nobuyuki yamamoto', 100.0, 2)]
noor j ridha [('noor j ridha', 100.0, 2)]
noor ramji [('noor ramji', 100.0, 3)]
norbert nagy [('norbert nagy', 100.0, 14)]
noriaki kobayashi [('noriaki kobayashi', 100.0, 2)]
noriyuki asai [('noriyuki asai', 100.0, 10)]
noriyuki takai [('noriyuki takai', 100.0, 2)]
nusrat a motlekar [('nusrat a motlekar', 100.0, 3)]
o ghasemi [('o ghasemi', 100.0, 0)]
o. giles best [('o giles best', 96.0, 11)]
ofer mandelboim [('ofer mandelboim', 100.0, 3)]
oh shin kwon [('ohshin kwon', 95.65217391304348, 0)]
olav a gressner [('olav a gressner', 100.0, 6)]
olga panagiotopoulou [('olga panagiotopoulou', 100.0, 3)]
olivier c

ryuzo kawamori [('ryuzo kawamori', 100.0, 4)]
rã¼diger f schwerdtle [('rudiger f schwerdtle', 92.6829268292683, 8)]
s ashokkumar [('s ashokkumar', 100.0, 4)]
s c sharma [('s c sharma', 100.0, 10)]
s chandrashekara [('s chandrashekara', 100.0, 1)]
s f dos reis [('s f dos reis', 100.0, 6)]
s h behiry [('s h behiry', 100.0, 0)]
s hari babu [('s hari babu', 100.0, 5)]
s k bera [('s k bera', 100.0, 4)]
s kalimuthu [('s kalimuthu', 100.0, 1)]
s margaritis [('s margaritis', 100.0, 7)]
s n cobb [('s n cobb', 100.0, 3)]
s nicholas mason [('s nicholas mason', 100.0, 2)]
s patra [('s patra', 100.0, 5)]
s rahman zadeh [('s rahman zadeh', 100.0, 4)]
s rajaram [('s rajaram', 100.0, 0)]
s ravi [('s ravi', 100.0, 4)]
s s kadam [('s s kadam', 100.0, 0)]
s sarkar [('s sarkar', 100.0, 1)]
s schuler [('s schuler', 100.0, 3)]
s srikanth [('s srikanth', 100.0, 1)]
s velmurugan [('s velmurugan', 100.0, 1)]
s wang [('s wang', 100.0, 1)]
sabeera bonala [('sabeera bonala', 100.0, 6)]
sachendra bohra [('sachendr

shunji sugawara [('shunji sugawara', 100.0, 4)]
shuxian jiang [('shuxian jiang', 100.0, 4)]
shyam biswal [('shyam biswal', 100.0, 2)]
shyamal k goswami [('shyamal k goswami', 100.0, 11)]
shyi-min lu [('shyimin lu', 95.23809523809523, 0)]
siavash riahi [('siavash riahi', 100.0, 2)]
siba prasada panigrahi [('siba prasada panigrahi', 100.0, 2)]
sidney tessler [('sidney tessler', 100.0, 0)]
sihua wang [('sihua wang', 100.0, 2)]
silvia baiguera [('silvia baiguera', 100.0, 5)]
silvia bulfone-paus [('silvia bulfonepaus', 97.2972972972973, 21)]
silvia novello [('silvia novello', 100.0, 3)]
silvia sauer [('silvia sauer', 100.0, 3)]
silvia soddu [('silvia soddu', 100.0, 6)]
simon folkard [('simon folkard', 100.0, 1)]
simon j gaskell [('simon j gaskell', 100.0, 6)]
simon s cross [('simon s cross', 100.0, 1)]
simone reuter [('simone reuter', 100.0, 8)]
simone sagen [('simone sagen', 100.0, 13)]
simonetta dell'orto [('simonetta dellorto', 97.2972972972973, 3)]
sinead m miggin [('sinead m miggin', 1

theodoros kofidis [('theodoros kofidis', 100.0, 13)]
theresa m guerin [('theresa guerin', 95.0, 2)]
thienhuong n hoang [('thienhuong n hoang', 100.0, 0)]
thierry appelboom [('thierry appelboom', 100.0, 2)]
thierry nouspikel [('thierry nouspikel', 100.0, 4)]
thierry ponchon [('thierry ponchon', 100.0, 2)]
thomas beaver [('thomas m beaver', 95.0, 2)]
thomas d schmittgen [('thomas d schmittgen', 100.0, 2)]
thomas f franke [('thomas f franke', 100.0, 3)]
thomas f koetzle [('thomas f koetzle', 100.0, 2)]
thomas g spiro [('thomas g spiro', 100.0, 1)]
thomas kraus [('thomas kraus', 100.0, 9)]
thomas linn [('thomas linn', 100.0, 2)]
thomas lundeberg [('thomas lundeberg', 100.0, 4)]
thomas m behr [('m m d thomas behr', 95.0, 3)]
thomas meyer [('thomas meyer', 100.0, 2)]
thomas p almdal [('thomas almdal', 95.0, 5)]
thomas pohl [('thomas pohl', 100.0, 11)]
thongchai taechowisan [('thongchai taechowisan', 100.0, 5)]
thu-suong van le [('thusuong van le', 96.7741935483871, 6)]
tian-hu he [('tianhu h

yannis georgalis [('yannis georgalis', 100.0, 3)]
yanping li [('yanping li', 100.0, 4)]
yanyan wang [('wang', 90.0, 1)]
yao chen [('yao chen', 100.0, 3)]
yao yang [('yao yang', 100.0, 23)]
yao zhang [('yao zhang', 100.0, 6)]
yaobin zhu [('yaobin zhu', 100.0, 3)]
yaron moshkovitz [('yaron moshkovitz', 100.0, 4)]
yashin sreenivasan [('yashin sreenivasan', 100.0, 3)]
yashpal s kanwar [('yashpal s kanwar', 100.0, 0)]
yasuaki yamada [('yasuaki yamada', 100.0, 6)]
yasuhiko ito [('yasuhiko ito', 100.0, 7)]
yasuhiko tabata [('yasuhiko tabata', 100.0, 0)]
yasuhiro minami [('yasuhiro minami', 100.0, 4)]
yasuhiro yamaguchi [('yasuhiro yamaguchi', 100.0, 6)]
yasukiyo mori [('yasukiyo mori', 100.0, 7)]
yasunobu shibasaki [('yasunobu shibasaki', 100.0, 10)]
yasuo matsumura [('yasuo matsumura', 100.0, 2)]
yasuyoshi mizutani [('yasuyoshi mizutani', 100.0, 2)]
yasuyoshi ouchi [('yasuyoshi ouchi', 100.0, 2)]
ye tian [('ye tian', 100.0, 6)]
ye xi [('ye xi', 100.0, 8)]
ye-shih ho [('yeshih ho', 94.7368421

In [61]:
rw_to_mag_mapping

{'a abdel motelib': 'a abdel motelib',
 'a anthony': 'a anthony',
 'a antony joseph': 'a antony joseph',
 'a c jesudoss prabhakaran': 'ac jesudoss prabhakaran',
 'a gattoni': 'a gattoni',
 'a giolis': 'a giolis',
 'a harkavyi': 'a harkavyi',
 'a iordache': 'a iordache',
 'a jake demetris': 'a jake demetris',
 'a k gupta': 'a k gupta',
 'a m k el-ghonemy': 'a m k elghonemy',
 'a n al-isa': 'a n alisa',
 'a parlato': 'a parlato',
 'a r m yusoff': 'a r m yusoff',
 'a rajendran': 'a rajendran',
 'a simon': 'a simon',
 'a venkata rao': 'a venkata rao',
 'aadithya b urs': 'aadithya b urs',
 'aaron s dumont': 'aaron s dumont',
 'abdeladhim ben abdeladhim': 'abdeladhim ben abdeladhim',
 'abdelilah chaoui': 'abdelilah chaoui',
 'abderrahman abdelkefi': 'abderrahman abdelkefi',
 'abdolmajid bayandori moghaddam': 'abdolmajid bayandori moghaddam',
 'abdullah agit': 'abdullah agit',
 'abdur rahman': 'md abdur rahman',
 'abhalaxmi singh': 'abhalaxmi singh',
 'abhijit datta': 'abhijit datta',
 'abrah

In [62]:
# Checking how many were mapped out of how many. Note though, this is not a fair comparison as we have included
# all authors post 2015 as well. Otherwise 3488 is a pretty good number and is out of 3800 ish.

print(f"Out of {len(offender_to_records.keys())} repeated offenders identified in RW between x to 2020, "\
          f"{len(rw_to_mag_mapping)} were mapped in MAG based on same Record ID and fuzzy matching score > 90")


Out of 6591 repeated offenders identified in RW between x to 2020, 3488 were mapped in MAG based on same Record ID and fuzzy matching score > 90


In [63]:
# Comment
"""
We have identified repeated offenders in MAG using fuzzy matching. 
These offenders are in rw_to_mag_mapping dictionary. 
"""

'\nWe have identified repeated offenders in MAG using fuzzy matching. \nThese offenders are in rw_to_mag_mapping dictionary. \n'

In [64]:
# Now we create a dataframe out of the dictionary first

df_repeated_offender_matching = pd.DataFrame(list(rw_to_mag_mapping.items()), 
                                             columns=['AuthorNameNorm','MAGAuthorName'])

df_repeated_offender_matching.head()
# We can notice that our fuzzy matching worked.


Unnamed: 0,AuthorNameNorm,MAGAuthorName
0,a abdel motelib,a abdel motelib
1,a anthony,a anthony
2,a antony joseph,a antony joseph
3,a c jesudoss prabhakaran,ac jesudoss prabhakaran
4,a gattoni,a gattoni


In [65]:
# For now let us just remove duplicates randomly

#df_repeated_offender_matching = df_repeated_offender_matching.drop_duplicates(subset='MAGAuthorName', keep='first')

# Now it is possible that our fuzzy matching resulted in two RW authors mapped to same MAG author
# What should we do in these cases?

df_repeated_offender_matching[df_repeated_offender_matching['MAGAuthorName'].\
                                  duplicated(keep=False)].sort_values(by='MAGAuthorName')


Unnamed: 0,AuthorNameNorm,MAGAuthorName


In [66]:
# The other way does not work i.e. there are no repeated author names in RW .. obviously
df_repeated_offender_matching[df_repeated_offender_matching['AuthorNameNorm'].duplicated(keep=False)]

Unnamed: 0,AuthorNameNorm,MAGAuthorName


In [67]:
# The above two cases of 0 dimension show that we don't have any problem now of disambiguation in RW.

In [68]:
# Merging the matched repeated offenders to the other columns 
df_authors_mag2 = df_authors_mag.merge(df_repeated_offender_matching, on='MAGAuthorName',
                                      how='left') # to get all authors



df_authors_mag2.drop_duplicates()

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,AuthorNameNorm
0,569306176,u w kirbach,23311,2273790572,2002.0,
1,2230134732,k e gregorich,23311,2273790572,2002.0,
2,2423554807,h nitsche,23311,2273790572,2002.0,
3,2462990842,p a wilk,23311,2273790572,2002.0,
4,2477911198,d c hoffman,23311,2273790572,2002.0,
...,...,...,...,...,...,...
27968,3170455413,하정민,6759,3168732224,2015.0,
27969,3171097566,,6759,3168732224,2015.0,
27970,3171812317,youngsu an,6759,3168732224,2015.0,
27971,3172678465,noh jungpil,6759,3168732224,2015.0,


In [69]:
df_authors_mag2[~df_authors_mag2['AuthorNameNorm'].isna()].sort_values(by='MAGAID')

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,AuthorNameNorm
9094,4937601,david m ojcius,3093,2097651531,2012.0,david m ojcius
4587,11038048,fumihiro sanada,1676,2067576098,2014.0,fumihiro sanada
3880,11607012,milena penkowa,3408,2169292567,2014.0,milena penkowa
3881,11607012,milena penkowa,3778,2616663063,2012.0,milena penkowa
3879,11607012,milena penkowa,3405,2112644312,2014.0,milena penkowa
...,...,...,...,...,...,...
14985,3175629790,eddy s leman,1694,2112372740,2012.0,eddy s leman
22852,3175900097,zhiguo wang,4604,2037997249,2011.0,zhiguo wang
26337,3176739328,mingwen fan,8206,2121329160,2012.0,mingwen fan
26338,3177103299,xuechao yang,8206,2121329160,2012.0,xuechao yang


In [70]:
"""
Task: Repeated Offenders

0) Cannot use MAG directly because RW-MAG merging has happened on very small dataset
---------- Solution:
1) Identify repeated offenders using RW's Author Name (Exact match, if not Fuzzy > 95)
Step (1) clusters authors under 1 name
2) Use the name in RW to match the name in RW-MAG matched record for the author.
Problem: Some authors still remain disambiguated in RW (because of 95 threshold)
But those same authors get matched to the same MAG Author ID.
How:
(RWnameA, RWnameB) <= 95
(RWnameA, MAGname) > 95
(RWnameA, MAGname) > 95
3) Went back to manually fix this.

4) Removed repeated offenders and put them as a separate set.
5) After removing repeated offenders identified by RW, ~350 Author ID's in MAG occur multiple times 
    i.e. have multiple retracted records linked in RW.
Why?
These were authors either missing in RW in terms of names, or were not identified as repeated offenders.
6) Suggestion: Remove them completely as we do not know their "first" retraction year because not all RW-MAG 
are merged.

Final Note: We decided to remove these authors

"""

df_authors_mag2.head(2)

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,AuthorNameNorm
0,569306176,u w kirbach,23311,2273790572,2002.0,
1,2230134732,k e gregorich,23311,2273790572,2002.0,


In [148]:
# At this point, we have resolved the repeated offenders based on names
# Resolved as in, we have identified their MAGAID
# Now we need to resolve some based on MAGAID, and identify their first RetractionYear and OffenseSameYear
# In fact, we don't have to resolve them, but just remove them

# So let's do that

# Identifying authors not retracted more than once
# This would be those that have AuthorNorm == NaN
df_single_offenders = df_authors_mag2[df_authors_mag2['AuthorNameNorm'].isna()]

# Removing those that were repeated offenders based on MAGAID
df_single_offenders = df_single_offenders[~df_single_offenders['MAGAID'].isin(repeated_offenders_per_MAGAID)]

# Removing those without any name in MAG
df_single_offenders = df_single_offenders[~df_single_offenders['MAGAuthorName'].isna()]

# Add columns for proper merging
df_single_offenders['nRetracted'] = 1

# Dropping redundant column
df_single_offenders = df_single_offenders.drop(columns=['AuthorNameNorm'])

df_single_offenders['MAGAID'].nunique(), df_single_offenders['Record ID'].nunique()

(19030, 5275)

In [149]:
df_single_offenders.head()

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,nRetracted
0,569306176,u w kirbach,23311,2273790572,2002.0,1
1,2230134732,k e gregorich,23311,2273790572,2002.0,1
2,2423554807,h nitsche,23311,2273790572,2002.0,1
3,2462990842,p a wilk,23311,2273790572,2002.0,1
4,2477911198,d c hoffman,23311,2273790572,2002.0,1


In [167]:
# Let us create author list for repeated offenders (i claled it multiple as repeated is used already)
df_multiple_offenders = df_authors_mag2[~df_authors_mag2['AuthorNameNorm'].isna()]

# Adding the two missing columns in multiple offenders: nRetracted and OffenseSameYear
df_multiple_offenders = df_multiple_offenders.merge(df_repeated_offenders[['AuthorNameNorm','nRetracted','FirstRetractionYear',
                                                  'OffenseSameYear']].drop_duplicates(), 
                           on='AuthorNameNorm')

# Removing authors that were repeated in MAG but not in RW
#problematic_set_repeated_in_MAG = set(repeated_offenders_per_MAGAID) - set(df_multiple_offenders['MAGAID'].unique())

#df_multiple_offenders = df_multiple_offenders[~df_multiple_offenders['MAGAID']\
#                                                .isin(problematic_set_repeated_in_MAG)]

# Only extracting the retractions which happened in the same year as first retraction
df_multiple_offenders = df_multiple_offenders[df_multiple_offenders['RetractionYear'] == df_multiple_offenders['FirstRetractionYear']]

# Removing those with offenseSameYear as false
df_multiple_offenders = df_multiple_offenders[df_multiple_offenders['OffenseSameYear']]

# Dropping redundant column from repeated offenders as well
df_multiple_offenders = df_multiple_offenders.drop(columns=['AuthorNameNorm','OffenseSameYear','FirstRetractionYear'])

df_multiple_offenders.head()

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,nRetracted
6,2105296260,mohd ali hashim,7966,2082253283,2014.0,2
9,2070170340,paolo pozzilli,3688,2033388709,2015.0,3
10,2070170340,paolo pozzilli,3687,2091554294,2015.0,3
14,1920490037,jerome e groopman,625,1969257293,2011.0,2
15,1920490037,jerome e groopman,363,2036017194,2011.0,2


## Sensibility checks

In [168]:
df_single_offenders.shape[0] == df_single_offenders['MAGAID'].nunique()

True

In [169]:
df_single_offenders['Record ID'].nunique(), df_single_offenders['RetractedPaperMAGPID'].nunique()

(5275, 5275)

In [170]:
df_single_offenders[df_single_offenders['MAGAuthorName'].isna()]

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,nRetracted


In [172]:
df_multiple_offenders

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,nRetracted
6,2105296260,mohd ali hashim,7966,2082253283,2014.0,2
9,2070170340,paolo pozzilli,3688,2033388709,2015.0,3
10,2070170340,paolo pozzilli,3687,2091554294,2015.0,3
14,1920490037,jerome e groopman,625,1969257293,2011.0,2
15,1920490037,jerome e groopman,363,2036017194,2011.0,2
...,...,...,...,...,...,...
7970,3143309865,wenzhu,240,3142982949,2013.0,2
7971,3142266553,wenzhu,241,3146044440,2013.0,2
7972,3143592260,caixia,240,3142982949,2013.0,2
7973,3144031722,caixia,241,3146044440,2013.0,2


## Merging and Saving

In [173]:
# Now we need to create the a single file for authors

df_merged = pd.concat([df_single_offenders,df_multiple_offenders])
df_merged

Unnamed: 0,MAGAID,MAGAuthorName,Record ID,RetractedPaperMAGPID,RetractionYear,nRetracted
0,569306176,u w kirbach,23311,2273790572,2002.0,1
1,2230134732,k e gregorich,23311,2273790572,2002.0,1
2,2423554807,h nitsche,23311,2273790572,2002.0,1
3,2462990842,p a wilk,23311,2273790572,2002.0,1
4,2477911198,d c hoffman,23311,2273790572,2002.0,1
...,...,...,...,...,...,...
7970,3143309865,wenzhu,240,3142982949,2013.0,2
7971,3142266553,wenzhu,241,3146044440,2013.0,2
7972,3143592260,caixia,240,3142982949,2013.0,2
7973,3144031722,caixia,241,3146044440,2013.0,2


In [174]:
# Constants
OUTPUT_DIRECTORY = OUTDIR
FILENAME = "RWMAG_authors_SingleAndRepeatedOffendersSameYear"

file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_merged.drop_duplicates().to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")

File saved successfully


In [175]:
df_multiple_offenders['Record ID'].nunique()

1244

In [176]:
df_merged['MAGAID'].nunique()

20713

In [177]:
df_merged['Record ID'].nunique()

5746