# Extracting repeated offenders

In this notebook, we shall flag authors that have been retracted multiple times. More concretely,

1) We will first identify all bulk retractions in RW, and flag them.

2) We will then extract all the author names from RW along with Record ID, RetractionDate, RetractionYear

3) We will then split the author names such that we have Author first name, last name, Record ID as separate columns

4) We will then identify authors that were

    a) Retracted just once

    b) Retracted multiple times if bulk retractions not included
    
    c) Retracted mmultiple times if bulk retractions included

5) For each author retracted multiple times, we will identify the difference in years between their first and second retraction. We will identify difference in years by (i) exact date, and (ii) by year, to allow for different levels of precision.

In [9]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config
from rapidfuzz import process, fuzz

In [3]:
# Reading paths
paths = read_config()
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [4]:
# Reading retraction watch
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH,
                   usecols=['Record ID', 'Author', 'RetractionDate', 'RetractionYear', 'Title',
                           'Journal'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Jing Xu;Jian He;Huang He;Renjun Peng;Jian Xi,2021-05-15,2021.0


## 1. Identifying bulk retractions

In [5]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw2 = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Flagging Records to be removed due to bulk retractions
df_rw2['RetractedInBulk'] = df_rw2['bulkCounts'].apply(lambda c: c >= 5)

# Removing records from 2021
df_rw2 = df_rw2[df_rw2['RetractionYear'].le(2020)]

# Removing bulk retractions
df_rw2 = df_rw2[~df_rw2['RetractedInBulk']]

df_rw2.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk
63,27832,Enantioselective Organocatalytic Hantzsch Synt...,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False


In [6]:
df_rw2['Record ID'].nunique()

14480

## 2. Extracting author names

In [7]:
pd.set_option('display.max_colwidth', None)

# Split the "Author" column by ";" and then explode it to separate rows
df_rw2['AuthorName'] = df_rw2['Author'].str.split(';')
df_exploded = df_rw2.explode('AuthorName')

# Removing empty authors
df_exploded['AuthorName'] = df_exploded['AuthorName'].str.strip()
df_exploded = df_exploded[df_exploded['AuthorName'].ne('') & 
                         ~df_exploded['AuthorName'].isna()]

# sorting
df_exploded.sort_values(by='AuthorName').reset_index().head(1)

Unnamed: 0,index,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName
0,3091,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane N Rafizadeh


In [34]:
# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorName')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})


df_numRetracted.head(1)

Unnamed: 0,AuthorName,nRetracted
0,/duane N Rafizadeh,1


In [37]:
# Printing authors will multiple retractions
df_numRetracted[df_numRetracted['nRetracted'].gt(1)]

Unnamed: 0,AuthorName,nRetracted
28,A A Nesterenko,3
36,A A Zafar,2
38,A Abdel Motelib,2
39,A Abdul Ajees,2
42,A Abou-Elela,2
...,...,...
51343,Zongxi Han,2
51371,Zu-Hua Gao,2
51384,Zulkanain Abdul Rahman,2
51399,Zuoren Wang,2


# Cleaning author names

In [8]:
# The problem now is that there are some authors that may have same name but not exactly same. 
# For example the author angela d'angelo and angela dangelo are both same except they occur 
# with different spellings in RW. So we need to merge these entries.
# We will do so by using fuzzy matching within RW author names
# Then we will normalize the name to a single entry, and then redo the above code 
# Think of this like normalizing the name (except also disambiguating)
# The example below shows that fuzzy matching could work well.

fuzz.ratio("angela d'angelo", "angela dangelo")

96.55172413793103

In [22]:
authornames = list(df_exploded['AuthorName'].unique())

#authornames.remove('Christopher G Evans')

len(authornames)

51434

In [29]:
# What we will do is create a dictionary i.e. for each author, we will run fuzzy match and extract top 3 variations
# We will limit the score to > 90 or > 95 -- after testing. 
# After testing -- it seems 96 works best as the threshold

authornames = list(df_exploded['AuthorName'].unique())

# For each authorname, we will identify their top matches with score > 96
name_to_correctName = {}

# We will do so until authornames are all gone through fuzzy matching
print(len(authornames))
while(len(authornames) > 0):
    
    current_author = authornames[0] # We fix 0 as we are removing authornames
    # Let us remove current author from authornames
    authornames.remove(current_author)
    # Let us find all the similar authornames
    choices = process.extract(current_author, authornames, score_cutoff=96)
    # Let us iterate through all the choices to get only names
    choices = [name for (name, score, freq) in choices]
    # Add to dictionary
    if len(choices) != 0:
        for choice in choices:
            name_to_correctName[choice] = current_author
        # Update authorname list
        authornames = [name for name in authornames if name not in choices]
        print(current_author, choices)
        print(f"Number of authors left: {len(authornames)}")


51434
Yandong Zhang ['Yadong Zhang']
Number of authors left: 51368
Catriona McLean ['Caitriona McLean']
Number of authors left: 51274
Rafael Arcesio Delgado-Ruiz ['Rafael Arcesio Delgado Ruiz']
Number of authors left: 51154
Manuel Fernandez Dominguez ['Manuel Fernandez-Dominguez']
Number of authors left: 51152
Andreas Hinz ['Andreas Heinz']
Number of authors left: 50678
Zhenlin Zhang ['Zhenling Zhang']
Number of authors left: 50638
Jianhua Zang ['Jianhua Zhang']
Number of authors left: 50599
Guoping Jiang ['Guoping Jian']
Number of authors left: 50590
Xiaoping Zhu ['Xiaoping Zhou']
Number of authors left: 50439
Sagartirtha Sarkar ['Sagatirtha Sarkar']
Number of authors left: 50294
Xiuying Chen ['Xiuying Cheng']
Number of authors left: 50099
Xiufeng Zhang ['Xufeng Zhang']
Number of authors left: 50080
Dong Hyun Kim ['Dong Hun Kim']
Number of authors left: 50046
Chengyong Wang ['Cheng-yong Wang']
Number of authors left: 49991
Shinichi Harad ['Shinichi Harada']
Number of authors left: 499

Jian-tong Jiao ['Jiantong Jiao']
Number of authors left: 30941
Soliman Mahmoud Soliman Abdalla ['Soliman Mahmoud Soliman abdalla']
Number of authors left: 30518
Zhengjun Wang ['Zhenjun Wang']
Number of authors left: 29918
Hongyuan Zhao ['Hongyan Zhao']
Number of authors left: 29826
Arrigo F G Cicero ['Arrigo FG Cicero']
Number of authors left: 29523
Jianming Zhou ['Jinming Zhou']
Number of authors left: 27985
Satoshi Konno ['Satoshi Kono']
Number of authors left: 27243
Shu-Bin Wang ['Shou-Bin Wang']
Number of authors left: 26956
Guo-Qing Chen ['Guo-Qiang Chen']
Number of authors left: 26797
Mickey M Martin ['Mickey M. Martin']
Number of authors left: 26643
Youhoon Chong ['Youhoon Cheong']
Number of authors left: 26557
Tae Young Kim ['Tae Yong Kim']
Number of authors left: 26553
Noman Khandoker ['Norman Khandoker']
Number of authors left: 26530
Jaime A Teixeira Da Silva ['Jaime A Teixeira da Silva']
Number of authors left: 25862
Steven C Huber ['Steven C. Huber']
Number of authors left:

In [32]:
# Disambiguating names

# Function to map names
def map_names(name):
    if name in name_to_correctName:
        return name_to_correctName[name]
    else:
        return name

# Apply mapping to create new column
df_exploded["AuthorNameDisambiguated"] = df_exploded["AuthorName"].map(lambda x: map_names(x))
df_exploded.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,AuthorNameDisambiguated
63,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False,Christopher G Evans,Christopher G Evans


In [38]:
# Now we will run the code to identify multiple authors again
# Note the column change: AuthorNameDisambiguated

# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorNameDisambiguated')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})


df_numRetracted.head(1)

Unnamed: 0,AuthorNameDisambiguated,nRetracted
0,/duane N Rafizadeh,1


In [39]:
# Printing authors will multiple retractions
df_numRetracted[df_numRetracted['nRetracted'].gt(1)]
# We see that there are more authors now

Unnamed: 0,AuthorNameDisambiguated,nRetracted
28,A A Nesterenko,3
36,A A Zafar,2
38,A Abdel Motelib,2
39,A Abdul Ajees,2
42,A Abou-Elela,2
...,...,...
51149,Zongxi Han,2
51177,Zu-Hua Gao,2
51190,Zulkanain Abdul Rahman,2
51205,Zuoren Wang,2


In [40]:
from dateutil.relativedelta import relativedelta

# merging with rw

df_rw3 = df_exploded.merge(df_numRetracted, on='AuthorNameDisambiguated')

# Convert 'RetractionDate' to datetime
df_rw3['RetractionDate'] = pd.to_datetime(df_rw3['RetractionDate'])

# Sort the DataFrame by 'AuthorNameDisambiguated' and 'RetractionDate'
df_rw3 = df_rw3.sort_values(by=['AuthorNameDisambiguated', 'RetractionDate'])

# Group by 'AuthorNameDisambiguated' and get the first RetractionDate
df_rw3_firstRetraction = df_rw3.groupby('AuthorNameDisambiguated')['RetractionDate'].min().reset_index()\
                                .rename(columns={'RetractionDate':'FirstRetractionDate'})

df_rw3 = df_rw3.merge(df_rw3_firstRetraction, on='AuthorNameDisambiguated', how='left')

# Computing difference from first retraction
# Calculate the difference in months
df_rw3['MonthsDiff'] = df_rw3.apply(lambda row: relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).years * 12 + relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).months, axis=1)

df_rw3.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,AuthorNameDisambiguated,nRetracted,FirstRetractionDate,MonthsDiff
0,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane N Rafizadeh,/duane N Rafizadeh,1,2020-01-23,0
1,25367,Factors Influencing the Success of Small and Medium-Sized Businesses,Economy and Entrepreneurship (Ð­ÐšÐžÐÐžÐœÐ˜ÐšÐ Ð˜ ÐŸÐ Ð•Ð”ÐŸÐ Ð˜ÐÐ˜ÐœÐÐ¢Ð•Ð›Ð¬Ð¡Ð¢Ð’Ðž),V (B) Egorichev (Ð•Ð³Ð¾Ñ€Ð¸Ñ‡ÐµÐ²);A (A) Zorina (Ð—Ð¾Ñ€Ð¸Ð½Ð°);P (ÐŸ) Malyarchuk (ÐœÐ°Ð»ÑÑ€Ñ‡ÑƒÐº);A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ);D (Ð”) Teremov (Ð¢ÐµÑ€ÐµÐ¼Ð¾Ð²),2020-07-07,2020.0,1.0,False,A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ),A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ),1,2020-07-07,0


In [41]:
# Now we need to first only extract authors retracted multiple times

df_repeated_offenders = df_rw3[df_rw3['nRetracted'].gt(1)]
df_repeated_offenders['Record ID'].nunique(), df_repeated_offenders['AuthorNameDisambiguated'].nunique()

(7456, 6522)

In [45]:
# Now we need to mark those that have months difference greater than or equal to 12
# We will mark others as same year retraction

# Extractong authors that were retracted beyond 12 months
authors_ge12months = df_repeated_offenders[df_repeated_offenders['MonthsDiff'].ge(12)]['AuthorNameDisambiguated'].unique()


# Extracting only authors that were repeated offenders in the same year
df_authors_lt12months = df_repeated_offenders[~df_repeated_offenders['AuthorNameDisambiguated'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_lt12months['OffenseSameYear'] = True

# Extracting authors that were repeated offenders beyond 12 months
df_authors_ge12months = df_repeated_offenders[df_repeated_offenders['AuthorNameDisambiguated'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_ge12months['OffenseSameYear'] = False


# Merging the two

df_repeated_offenders2 = pd.concat([df_authors_lt12months,df_authors_ge12months])

# Extracting first retraction year
df_repeated_offenders2['FirstRetractionYear'] = df_repeated_offenders2['FirstRetractionDate'].dt.year

# At this point we will just remove AuthorName and call AuthorNameDisambiguated as AuthorName
df_repeated_offenders2 = df_repeated_offenders2.drop(columns=['AuthorName'])\
                                    .rename(columns={'AuthorNameDisambiguated':'AuthorName'})

df_repeated_offenders2.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,FirstRetractionDate,MonthsDiff,OffenseSameYear,FirstRetractionYear
28,25214,Properties and Compatibility of Microflora for Creating Starter Cultures in Sausage Production Technology,International Journal of Recent Technology and Engineering,A A Nesterenko;A G Koshchaev;N V Keniiz;R S Omarov;S N Shlykov,2020-09-25,2020.0,1.0,False,A A Nesterenko,3,2020-09-25,0,True,2020
29,25101,"Development of Technology for Producing Organic Pork with the Introduce of Probiotics, Prebiotics and Synbiotics into the Diet",International Journal of Innovative Technology and Exploring Engineering,N N Zabashta;A A Nesterenko;A Zabashta;V I Guzenko;E N Chernobai,2020-12-07,2020.0,1.0,False,A A Nesterenko,3,2020-09-25,2,True,2020


In [48]:
df_repeated_offenders2['AuthorName'].nunique()

6522

# Saving

In [46]:
# Finally saving with relevant columns

relevant_cols = ['Record ID', 'AuthorName', 'nRetracted', 'FirstRetractionYear', 'OffenseSameYear']

# Constants
OUTPUT_DIRECTORY = OUTDIR
FILENAME = "RW_repeated_offenders"

file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_repeated_offenders2[relevant_cols].drop_duplicates().to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")


File saved successfully
