# Extracting repeated offenders

In this notebook, we shall flag authors that have been retracted multiple times. More concretely,

1) We will first identify all bulk retractions in RW, and flag them.

2) We will then extract all the author names from RW along with Record ID, RetractionDate, RetractionYear

3) We will then split the author names such that we have Author first name, last name, Record ID as separate columns

4) We will then identify authors that were

    a) Retracted just once

    b) Retracted multiple times if bulk retractions not included
    
    c) Retracted mmultiple times if bulk retractions included

5) For each author retracted multiple times, we will identify the difference in years between their first and second retraction. We will identify difference in years by (i) exact date, and (ii) by year, to allow for different levels of precision.

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config
from rapidfuzz import process, fuzz

In [2]:
# Reading paths
paths = read_config()
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [3]:
# Reading retraction watch
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH,
                   usecols=['Record ID', 'Author', 'RetractionDate', 'RetractionYear', 'Title',
                           'Journal'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Jing Xu;Jian He;Huang He;Renjun Peng;Jian Xi,2021-05-15,2021.0


## 1. Identifying bulk retractions

In [4]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw2 = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Flagging Records to be removed due to bulk retractions
df_rw2['RetractedInBulk'] = df_rw2['bulkCounts'].apply(lambda c: c >= 5)

# Removing records from 2021
df_rw2 = df_rw2[df_rw2['RetractionYear'].le(2020)]

# Removing bulk retractions
df_rw2 = df_rw2[~df_rw2['RetractedInBulk']]

df_rw2.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk
63,27832,Enantioselective Organocatalytic Hantzsch Synt...,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False


In [5]:
df_rw2['Record ID'].nunique()

14480

## 2. Extracting author names

In [6]:
pd.set_option('display.max_colwidth', None)

# Split the "Author" column by ";" and then explode it to separate rows
df_rw2['AuthorName'] = df_rw2['Author'].str.split(';')
df_exploded = df_rw2.explode('AuthorName')

# Removing empty authors
df_exploded['AuthorName'] = df_exploded['AuthorName'].str.strip().str.lower()
df_exploded = df_exploded[df_exploded['AuthorName'].ne('') & 
                         ~df_exploded['AuthorName'].isna()]

# sorting
df_exploded.sort_values(by='AuthorName').reset_index().head(1)

Unnamed: 0,index,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName
0,3091,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane n rafizadeh


In [7]:
# Before we move, we will change some names 
# These names were matched in MAG, but there is a weird problem
"""
The problem is when identifying repeated offenders, 
we are doing author disambiguation in RW, and we put the 
fuzzy matching score to be 96 or above as criteria, 
but there are names that do not fit this criteria but are still the same authors. 
So these authors remain disambiguated in RW. 
What happens then is that when I do fuzzy matching of names in MAG to match RW authors to MAG authors, 
two authors who have not been disambiguated get matched to the same 
MAG author. In other words, both names in RW are in close match with 
the MAG author name, but not with each other. 
So we will just change these names here for simplicity
"""


rw_to_mag_authorname = {"angela d'angelo": 'angela dangelo',
 'angela dâ€™angelo': 'angela dangelo',
 'li bin': 'bin li',
 'bin li': 'bin li',
 'chenhui qiao': 'chenhui qiao',
 'qiao chenhui': 'chenhui qiao',
 'clement asiedu': 'clement asiedu',
 'clement k asiedu': 'clement asiedu',
 'dong hee shin': 'donghee shin',
 'dong-hee shin': 'donghee shin',
 'eric poehlman': 'eric t poehlman',
 'eric t poehlman': 'eric t poehlman',
 'hyung in moon': 'hyungin moon',
 'hyung-in moon': 'hyungin moon',
 'jiahong xia': 'jiahong xia',
 'xia jiahong': 'jiahong xia',
 'jing chen': 'jing chen',
 'chen jing': 'jing chen',
 'jose a martinez': 'jose a martinez',
 'jose martinez': 'jose a martinez',
 'zhang kailun': 'kailun zhang',
 'kailun zhang': 'kailun zhang',
 'kyung-hee paek': 'kyung hee paek',
 'kyung hee paek': 'kyung hee paek',
 'lian-da li': 'lianda li',
 'lianda li': 'lianda li',
 'limao wu': 'limao wu',
 'li-mao wu': 'limao wu',
 'lu zhang': 'lu zhang',
 'l zhang': 'lu zhang',
 'mi sun kim': 'misun kim',
 'mi-sun kim': 'misun kim',
 'ning li': 'ning li',
 'li ning': 'ning li',
 'roland h mertelsmann': 'roland mertelsmann',
 'roland mertelsmann': 'roland mertelsmann',
 'yu liu': 'yu liu',
 'y liu': 'yu liu'}


df_exploded['AuthorName']=df_exploded['AuthorName'].replace(rw_to_mag_authorname)
df_exploded.head()

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName
63,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False,christopher g evans
63,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False,jason e gestwicki
581,25550,Mechanistic attributes of S100A7 (psoriasin) in resistance of anoikis resulting tumor progression in squamous cell carcinoma of the oral cavity,Cancer Cell International,Kaushik Kumar Dey;Siddik Sarkar;Ipsita Pal;Subhasis Das;Goutam Dey;Rashmi Bharti;Payel Banik;Jay Gopal Roy;Sukumar Maity;Indranil Kulavi;Mahitosh Mandal,2015-10-08,2015.0,1.0,False,kaushik kumar dey
581,25550,Mechanistic attributes of S100A7 (psoriasin) in resistance of anoikis resulting tumor progression in squamous cell carcinoma of the oral cavity,Cancer Cell International,Kaushik Kumar Dey;Siddik Sarkar;Ipsita Pal;Subhasis Das;Goutam Dey;Rashmi Bharti;Payel Banik;Jay Gopal Roy;Sukumar Maity;Indranil Kulavi;Mahitosh Mandal,2015-10-08,2015.0,1.0,False,siddik sarkar
581,25550,Mechanistic attributes of S100A7 (psoriasin) in resistance of anoikis resulting tumor progression in squamous cell carcinoma of the oral cavity,Cancer Cell International,Kaushik Kumar Dey;Siddik Sarkar;Ipsita Pal;Subhasis Das;Goutam Dey;Rashmi Bharti;Payel Banik;Jay Gopal Roy;Sukumar Maity;Indranil Kulavi;Mahitosh Mandal,2015-10-08,2015.0,1.0,False,ipsita pal


In [8]:
# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorName')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})


df_numRetracted.head(1)

Unnamed: 0,AuthorName,nRetracted
0,/duane n rafizadeh,1


In [9]:
# Printing authors will multiple retractions
df_numRetracted[df_numRetracted['nRetracted'].gt(1)]

Unnamed: 0,AuthorName,nRetracted
29,a a nesterenko,3
37,a a zafar,2
38,a abdel motelib,2
39,a abdul ajees,2
42,a abou-elela,2
...,...,...
51247,zongxi han,2
51275,zu-hua gao,2
51288,zulkanain abdul rahman,2
51303,zuoren wang,2


# Cleaning author names

In [10]:
# The problem now is that there are some authors that may have same name but not exactly same. 
# For example the author angela d'angelo and angela dangelo are both same except they occur 
# with different spellings in RW. So we need to merge these entries.
# We will do so by using fuzzy matching within RW author names
# Then we will normalize the name to a single entry, and then redo the above code 
# Think of this like normalizing the name (except also disambiguating)
# The example below shows that fuzzy matching could work well.

fuzz.ratio("angela d'angelo", "angela dâ€™angelo")

87.5

In [11]:
authornames = list(df_exploded['AuthorName'].unique())

#authornames.remove('Christopher G Evans')

len(authornames)

51329

In [12]:
# What we will do is create a dictionary i.e. for each author, we will run fuzzy match and extract top 3 variations
# We will limit the score to > 90 or > 95 -- after testing. 
# After testing -- it seems 96 works best as the threshold

authornames = list(df_exploded['AuthorName'].unique())

# For each authorname, we will identify their top matches with score > 96
name_to_correctName = {}

# We will do so until authornames are all gone through fuzzy matching
print(len(authornames))
while(len(authornames) > 0):
    
    current_author = authornames[0] # We fix 0 as we are removing authornames
    # Let us remove current author from authornames
    authornames.remove(current_author)
    # Let us find all the similar authornames
    choices = process.extract(current_author, authornames, score_cutoff=96)
    # Let us iterate through all the choices to get only names
    choices = [name for (name, score, freq) in choices]
    # Add to dictionary
    if len(choices) != 0:
        for choice in choices:
            name_to_correctName[choice] = current_author
        # Update authorname list
        authornames = [name for name in authornames if name not in choices]
        print(current_author, choices)
        print(f"Number of authors left: {len(authornames)}")


51329
yandong zhang ['yadong zhang']
Number of authors left: 51263
zhi ping zhang ['zhiping zhang']
Number of authors left: 51216
catriona mclean ['caitriona mclean']
Number of authors left: 51168
rafael arcesio delgado-ruiz ['rafael arcesio delgado ruiz']
Number of authors left: 51048
manuel fernandez dominguez ['manuel fernandez-dominguez']
Number of authors left: 51046
guanghui wei ['guang hui wei']
Number of authors left: 50935
g krishnamurthy naidu ['g krishna murthy naidu']
Number of authors left: 50692
andreas hinz ['andreas heinz']
Number of authors left: 50570
zhenlin zhang ['zhenling zhang']
Number of authors left: 50530
chaojun duan ['chao-jun duan']
Number of authors left: 50528
jianhua zang ['jianhua zhang']
Number of authors left: 50490
guoping jiang ['guoping jian']
Number of authors left: 50481
xiaoping zhu ['xiaoping zhou']
Number of authors left: 50330
chengyong qin ['cheng-yong qin']
Number of authors left: 50201
sagartirtha sarkar ['sagatirtha sarkar']
Number of aut

xiao hui deng ['xiaohui deng']
Number of authors left: 39114
fernando aros ['fernandos aros']
Number of authors left: 39064
giovanni tallini, ['giovanni tallini']
Number of authors left: 38955
munir pirmohamaed ['munir pirmohamed']
Number of authors left: 38809
yong ming wang ['yongming wang']
Number of authors left: 38797
sun kyoung han ['sun young han']
Number of authors left: 38506
ali reza yaghoubi ['alireza yaghoubi']
Number of authors left: 38467
al refaey kandeel ['alrefaey kandeel']
Number of authors left: 38454
polina goihberg ['polina goichberg']
Number of authors left: 38326
d g (ð” ð“) volkova (ð’ð¾ð»ðºð¾ð²ð°) ['d g (ð” ð“) volkov (ð’ð¾ð»ðºð¾ð²ð°)']
Number of authors left: 38246
constantino del gaudio ['costantino del gaudio']
Number of authors left: 37838
yun feng zhao ['yu feng zhao']
Number of authors left: 37764
konstantinos francis ['kostantinos francis']
Number of authors left: 37657
xiaojuan huang ['xiaojun huang']
Number of authors left: 37606
michael olausson ['mic

anne e willis ['anne e wills']
Number of authors left: 12392
spiros parmithiotis ['spiros paramithiotis']
Number of authors left: 12096
xiaoning wang ['xiao-ning wang']
Number of authors left: 11946
hong wei chen ['hongwei chen']
Number of authors left: 11748
bjorn wachsmann ['bjoern wachsmann']
Number of authors left: 11593
kalanithi nesaretam ['kalanithi nesaretnam']
Number of authors left: 11386
yutaka atomi ['yutaka yatomi']
Number of authors left: 11202
haribalaganesh ravinarayanan ['haribalaganesh ravinarayannan']
Number of authors left: 10006
jian-jun zhao ['jianjun zhao']
Number of authors left: 9268
helen neighbour ['helen neighbor']
Number of authors left: 8486
nikolaos papadopoulos ['nikolas papadopoulos']
Number of authors left: 8107
r padmanabhan ['r. padmanabhan']
Number of authors left: 7123
cristina panuzzo ['christina panuzzo']
Number of authors left: 7099
xinsheng yao ['xin-sheng yao']
Number of authors left: 6063
mengqiong shi ['meng-qiong shi']
Number of authors lef

In [13]:
# Disambiguating names

# Function to map names
def map_names(name):
    if name in name_to_correctName:
        return name_to_correctName[name]
    else:
        return name

# Apply mapping to create new column
df_exploded["AuthorNameDisambiguated"] = df_exploded["AuthorName"].map(lambda x: map_names(x))
df_exploded.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,AuthorNameDisambiguated
63,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False,christopher g evans,christopher g evans


In [14]:
# Now we will run the code to identify multiple authors again
# Note the column change: AuthorNameDisambiguated

# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorNameDisambiguated')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})


df_numRetracted.head(1)

Unnamed: 0,AuthorNameDisambiguated,nRetracted
0,/duane n rafizadeh,1


In [15]:
# Printing authors will multiple retractions
df_numRetracted[df_numRetracted['nRetracted'].gt(1)]
# We see that there are more authors now

Unnamed: 0,AuthorNameDisambiguated,nRetracted
29,a a nesterenko,3
37,a a zafar,2
38,a abdel motelib,2
39,a abdul ajees,2
42,a abou-elela,2
...,...,...
50959,zongxi han,2
50987,zu-hua gao,2
51000,zulkanain abdul rahman,2
51015,zuoren wang,2


In [16]:
from dateutil.relativedelta import relativedelta

# merging with rw

df_rw3 = df_exploded.merge(df_numRetracted, on='AuthorNameDisambiguated')

# Convert 'RetractionDate' to datetime
df_rw3['RetractionDate'] = pd.to_datetime(df_rw3['RetractionDate'])

# Sort the DataFrame by 'AuthorNameDisambiguated' and 'RetractionDate'
df_rw3 = df_rw3.sort_values(by=['AuthorNameDisambiguated', 'RetractionDate'])

# Group by 'AuthorNameDisambiguated' and get the first RetractionDate
df_rw3_firstRetraction = df_rw3.groupby('AuthorNameDisambiguated')['RetractionDate'].min().reset_index()\
                                .rename(columns={'RetractionDate':'FirstRetractionDate'})

df_rw3 = df_rw3.merge(df_rw3_firstRetraction, on='AuthorNameDisambiguated', how='left')

# Computing difference from first retraction
# Calculate the difference in months
df_rw3['MonthsDiff'] = df_rw3.apply(lambda row: relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).years * 12 + relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).months, axis=1)

df_rw3.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,AuthorNameDisambiguated,nRetracted,FirstRetractionDate,MonthsDiff
0,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane n rafizadeh,/duane n rafizadeh,1,2020-01-23,0
1,25367,Factors Influencing the Success of Small and Medium-Sized Businesses,Economy and Entrepreneurship (Ð­ÐšÐžÐÐžÐœÐ˜ÐšÐ Ð˜ ÐŸÐ Ð•Ð”ÐŸÐ Ð˜ÐÐ˜ÐœÐÐ¢Ð•Ð›Ð¬Ð¡Ð¢Ð’Ðž),V (B) Egorichev (Ð•Ð³Ð¾Ñ€Ð¸Ñ‡ÐµÐ²);A (A) Zorina (Ð—Ð¾Ñ€Ð¸Ð½Ð°);P (ÐŸ) Malyarchuk (ÐœÐ°Ð»ÑÑ€Ñ‡ÑƒÐº);A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ);D (Ð”) Teremov (Ð¢ÐµÑ€ÐµÐ¼Ð¾Ð²),2020-07-07,2020.0,1.0,False,a (a) bochtovaya (ð‘ð¾ñ‡ñ‚ð¾ð²ð°ñ),a (a) bochtovaya (ð‘ð¾ñ‡ñ‚ð¾ð²ð°ñ),1,2020-07-07,0


In [17]:
# Now we need to first only extract authors retracted multiple times

df_repeated_offenders = df_rw3[df_rw3['nRetracted'].gt(1)]
df_repeated_offenders['Record ID'].nunique(), df_repeated_offenders['AuthorNameDisambiguated'].nunique()

(7512, 6591)

In [18]:
# Now we need to mark those that have months difference greater than or equal to 12
# We will mark others as same year retraction

# Extractong authors that were retracted beyond 12 months
authors_ge12months = df_repeated_offenders[df_repeated_offenders['MonthsDiff'].ge(12)]['AuthorNameDisambiguated'].unique()


# Extracting only authors that were repeated offenders in the same year
df_authors_lt12months = df_repeated_offenders[~df_repeated_offenders['AuthorNameDisambiguated'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_lt12months['OffenseSameYear'] = True

# Extracting authors that were repeated offenders beyond 12 months
df_authors_ge12months = df_repeated_offenders[df_repeated_offenders['AuthorNameDisambiguated'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_ge12months['OffenseSameYear'] = False


# Merging the two

df_repeated_offenders2 = pd.concat([df_authors_lt12months,df_authors_ge12months])

# Extracting first retraction year
df_repeated_offenders2['FirstRetractionYear'] = df_repeated_offenders2['FirstRetractionDate'].dt.year

# At this point we will just remove AuthorName and call AuthorNameDisambiguated as AuthorName
df_repeated_offenders2 = df_repeated_offenders2.drop(columns=['AuthorName'])\
                                    .rename(columns={'AuthorNameDisambiguated':'AuthorName'})

df_repeated_offenders2.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,FirstRetractionDate,MonthsDiff,OffenseSameYear,FirstRetractionYear
29,25214,Properties and Compatibility of Microflora for Creating Starter Cultures in Sausage Production Technology,International Journal of Recent Technology and Engineering,A A Nesterenko;A G Koshchaev;N V Keniiz;R S Omarov;S N Shlykov,2020-09-25,2020.0,1.0,False,a a nesterenko,3,2020-09-25,0,True,2020
30,25101,"Development of Technology for Producing Organic Pork with the Introduce of Probiotics, Prebiotics and Synbiotics into the Diet",International Journal of Innovative Technology and Exploring Engineering,N N Zabashta;A A Nesterenko;A Zabashta;V I Guzenko;E N Chernobai,2020-12-07,2020.0,1.0,False,a a nesterenko,3,2020-09-25,2,True,2020


In [19]:
df_repeated_offenders2['AuthorName'].nunique()

6591

# Saving

In [20]:
# Finally saving with relevant columns

relevant_cols = ['Record ID', 'AuthorName', 'nRetracted', 'FirstRetractionYear', 'OffenseSameYear']

# Constants
OUTPUT_DIRECTORY = OUTDIR
FILENAME = "RW_repeated_offenders"

file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_repeated_offenders2[relevant_cols].drop_duplicates().to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")


File saved successfully
