# Extracting repeated offenders

In this notebook, we shall flag authors that have been retracted multiple times. More concretely,

1) We will first identify all bulk retractions in RW, and flag them.

2) We will then extract all the author names from RW along with Record ID, RetractionDate, RetractionYear

3) We will then split the author names such that we have Author first name, last name, Record ID as separate columns

4) We will then identify authors that were

    a) Retracted just once

    b) Retracted multiple times if bulk retractions not included
    
    c) Retracted mmultiple times if bulk retractions included

5) For each author retracted multiple times, we will identify the difference in years between their first and second retraction. We will identify difference in years by (i) exact date, and (ii) by year, to allow for different levels of precision.

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [2]:
# Reading paths
paths = read_config()
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [3]:
# Reading retraction watch
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH,
                   usecols=['Record ID', 'Author', 'RetractionDate', 'RetractionYear', 'Title',
                           'Journal'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Jing Xu;Jian He;Huang He;Renjun Peng;Jian Xi,2021-05-15,2021.0


## 1. Identifying bulk retractions

In [4]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw2 = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Flagging Records to be removed due to bulk retractions
df_rw2['RetractedInBulk'] = df_rw2['bulkCounts'].apply(lambda c: c >= 5)

# Removing records from 2021
df_rw2 = df_rw2[df_rw2['RetractionYear'].le(2020)]

# Removing bulk retractions
df_rw2 = df_rw2[~df_rw2['RetractedInBulk']]

df_rw2.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk
63,27832,Enantioselective Organocatalytic Hantzsch Synt...,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False


In [5]:
df_rw2['Record ID'].nunique()

14480

## 2. Extracting author names

In [6]:
pd.set_option('display.max_colwidth', None)

# Split the "Author" column by ";" and then explode it to separate rows
df_rw2['AuthorName'] = df_rw2['Author'].str.split(';')
df_exploded = df_rw2.explode('AuthorName')

# Removing empty authors
df_exploded['AuthorName'] = df_exploded['AuthorName'].str.strip()
df_exploded = df_exploded[df_exploded['AuthorName'].ne('') & 
                         ~df_exploded['AuthorName'].isna()]

# sorting
df_exploded.sort_values(by='AuthorName').reset_index().head(1)

Unnamed: 0,index,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName
0,3091,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane N Rafizadeh


In [7]:
# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorName')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})


df_numRetracted.head(1)

Unnamed: 0,AuthorName,nRetracted
0,/duane N Rafizadeh,1


In [8]:
from dateutil.relativedelta import relativedelta

# merging with rw

df_rw3 = df_exploded.merge(df_numRetracted, on='AuthorName')

# Convert 'RetractionDate' to datetime
df_rw3['RetractionDate'] = pd.to_datetime(df_rw3['RetractionDate'])

# Sort the DataFrame by 'AuthorName' and 'RetractionDate'
df_rw3 = df_rw3.sort_values(by=['AuthorName', 'RetractionDate'])

# Group by 'AuthorName' and get the first RetractionDate
df_rw3_firstRetraction = df_rw3.groupby('AuthorName')['RetractionDate'].min().reset_index()\
                                .rename(columns={'RetractionDate':'FirstRetractionDate'})

df_rw3 = df_rw3.merge(df_rw3_firstRetraction, on='AuthorName', how='left')

# Computing difference from first retraction
# Calculate the difference in months
df_rw3['MonthsDiff'] = df_rw3.apply(lambda row: relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).years * 12 + relativedelta(row['RetractionDate'], 
                            row['FirstRetractionDate']).months, axis=1)

df_rw3.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,FirstRetractionDate,MonthsDiff
0,22375,Targeted Protein Internalization and Degradation by ENDosome TArgeting Chimeras (ENDTACs),ACS Central Science,Dhanusha A Nalawansha;Stacey-Lynn Paiva;/duane N Rafizadeh;Mariell Pettersson;Liena Qin;Craig M Crews,2020-01-23,2020.0,1.0,False,/duane N Rafizadeh,1,2020-01-23,0
1,25367,Factors Influencing the Success of Small and Medium-Sized Businesses,Economy and Entrepreneurship (Ð­ÐšÐžÐÐžÐœÐ˜ÐšÐ Ð˜ ÐŸÐ Ð•Ð”ÐŸÐ Ð˜ÐÐ˜ÐœÐÐ¢Ð•Ð›Ð¬Ð¡Ð¢Ð’Ðž),V (B) Egorichev (Ð•Ð³Ð¾Ñ€Ð¸Ñ‡ÐµÐ²);A (A) Zorina (Ð—Ð¾Ñ€Ð¸Ð½Ð°);P (ÐŸ) Malyarchuk (ÐœÐ°Ð»ÑÑ€Ñ‡ÑƒÐº);A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ);D (Ð”) Teremov (Ð¢ÐµÑ€ÐµÐ¼Ð¾Ð²),2020-07-07,2020.0,1.0,False,A (A) Bochtovaya (Ð‘Ð¾Ñ‡Ñ‚Ð¾Ð²Ð°Ñ),1,2020-07-07,0


In [9]:
# Now we need to first only extract authors retracted multiple times

df_repeated_offenders = df_rw3[df_rw3['nRetracted'].gt(1)]
df_repeated_offenders['Record ID'].nunique(), df_repeated_offenders['AuthorName'].nunique()

(7370, 6407)

In [10]:
# Now we need to mark those that have months difference greater than or equal to 12
# We will mark others as same year retraction

# Extractong authors that were retracted beyond 12 months
authors_ge12months = df_repeated_offenders[df_repeated_offenders['MonthsDiff'].ge(12)]['AuthorName'].unique()


# Extracting only authors that were repeated offenders in the same year
df_authors_lt12months = df_repeated_offenders[~df_repeated_offenders['AuthorName'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_lt12months['OffenseSameYear'] = True

# Extracting authors that were repeated offenders beyond 12 months
df_authors_ge12months = df_repeated_offenders[df_repeated_offenders['AuthorName'].\
                                            isin(authors_ge12months)].copy()
# Flagging them
df_authors_ge12months['OffenseSameYear'] = False


# Merging the two

df_repeated_offenders2 = pd.concat([df_authors_lt12months,df_authors_ge12months])

# Extracting first retraction year
df_repeated_offenders2['FirstRetractionYear'] = df_repeated_offenders2['FirstRetractionDate'].dt.year


df_repeated_offenders2.head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,FirstRetractionDate,MonthsDiff,OffenseSameYear,FirstRetractionYear
28,25214,Properties and Compatibility of Microflora for Creating Starter Cultures in Sausage Production Technology,International Journal of Recent Technology and Engineering,A A Nesterenko;A G Koshchaev;N V Keniiz;R S Omarov;S N Shlykov,2020-09-25,2020.0,1.0,False,A A Nesterenko,3,2020-09-25,0,True,2020
29,25101,"Development of Technology for Producing Organic Pork with the Introduce of Probiotics, Prebiotics and Synbiotics into the Diet",International Journal of Innovative Technology and Exploring Engineering,N N Zabashta;A A Nesterenko;A Zabashta;V I Guzenko;E N Chernobai,2020-12-07,2020.0,1.0,False,A A Nesterenko,3,2020-09-25,2,True,2020


# Saving

In [11]:
# Finally saving with relevant columns

relevant_cols = ['Record ID', 'AuthorName', 'nRetracted', 'FirstRetractionYear', 'OffenseSameYear']

# Constants
OUTPUT_DIRECTORY = OUTDIR
FILENAME = "RW_repeated_offenders"

file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_repeated_offenders2[relevant_cols].drop_duplicates().to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")


File saved successfully
