# Extracting repeated offenders

In this notebook, we shall flag authors that have been retracted multiple times. More concretely,

1) We will first identify all bulk retractions in RW, and flag them.

2) We will then extract all the author names from RW along with Record ID, RetractionDate, RetractionYear

3) We will then split the author names such that we have Author first name, last name, Record ID as separate columns

4) We will then identify authors that were

    a) Retracted just once

    b) Retracted multiple times if bulk retractions not included
    
    c) Retracted mmultiple times if bulk retractions included

5) For each author retracted multiple times, we will identify the difference in years between their first and second retraction. We will identify difference in years by (i) exact date, and (ii) by year, to allow for different levels of precision.

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [2]:
# Reading paths
paths = read_config()
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']

In [7]:
# Reading retraction watch
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH,
                   usecols=['Record ID', 'Author', 'RetractionDate', 'RetractionYear', 'Title',
                           'Journal'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Jing Xu;Jian He;Huang He;Renjun Peng;Jian Xi,2021-05-15,2021.0


## 1. Identifying bulk retractions

In [60]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw2 = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Flagging Records to be removed due to bulk retractions
df_rw2['RetractedInBulk'] = df_rw2['bulkCounts'].apply(lambda c: c >= 5)

# Removing records from 2021
df_rw2 = df_rw2[df_rw2['RetractionYear'].le(2020)]

# Removing bulk retractions
df_rw2 = df_rw2[~df_rw2['RetractedInBulk']]

df_rw2.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk
63,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False


## 2. Extracting author names

In [93]:
pd.set_option('display.max_colwidth', None)

# Split the "Author" column by ";" and then explode it to separate rows
df_rw2['AuthorName'] = df_rw2['Author'].str.split(';')
df_exploded = df_rw2.explode('AuthorName')

# Removing empty authors
df_exploded['AuthorName'] = df_exploded['AuthorName'].str.strip()
df_exploded = df_exploded[df_exploded['AuthorName'].ne('') & 
                         ~df_exploded['AuthorName'].isna()]

# sorting to check for anomalies
df_exploded.sort_values(by='AuthorName').reset_index().tail(50)

Unnamed: 0,index,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName
64966,11673,12826,Promising Unconventional Pretreatments for Lignocellulosic Biomass,Critical Reviews in Environmental Science and Technology,Zumar M A Bundhoo;Ackmez Mudhoo;Romeela Mohee,2014-11-04,2014.0,1.0,False,Zumar M A Bundhoo
64967,18437,5997,Multimorbidity - not just an older person's issue. Results from an Australian biomedical study,Social Psychiatry and Psychiatric Epidemiology,Anne W Taylor;Kay Price;Tiffany K Gill;Robert Adams;Rhiannon Pilkington;Natalie Carangis;Zumin Shi;David Wilson,2011-04-01,2011.0,1.0,False,Zumin Shi
64968,1669,24136,Spinal circRNA-9119 Suppresses Nociception by Mediating the miR-26a-TLR3 Axis in a Bone Cancer Pain Mouse Model,Journal of Molecular Neuroscience,Zhongqi Zhang;Xiaoxia Zhang;Yanjing Zhang;Jiyuan Li;Zumin Xing;Yiwen Zhang,2020-09-01,2020.0,1.0,False,Zumin Xing
64969,4859,20244,Influence of Two Common Polymorphisms in the EPHX1 Gene on Warfarin Maintenance Dosage: A Meta-Analysis,BioMed Research International,Hong Qiang Liu;Chang Po Zhang;Chang Zhen Zhang;Xiang Chen Liu;Zun Jing Liu,2019-03-07,2019.0,1.0,False,Zun Jing Liu
64970,5042,19872,Hsa-miR-623 suppresses tumor progression in human lung adenocarcinoma,Cell Death & Disease,Shuang Wei;Zun Yi Zhang;Sheng Ling Fu;Jun Gang Xie;Xian Sheng Liu;Yong Jian Xu;Jian Ping Zhao;Wei Ning Xiong,2019-01-25,2019.0,1.0,False,Zun Yi Zhang
64971,4036,20997,Small Intestinal Tumors: A Rare Case of Tubulovillous Adenoma in Duodenum,Cureus,Mustafa N Malik;Zunairah M Shah;Abdul Rafae;Tayyab Mahmood;Hafiz M Fazeel,2019-08-06,2019.0,2.0,False,Zunairah M Shah
64972,1973,23809,LncRNA MALAT1up-regulates VEGF-A and ANGPT2 to promote angiogenesis in brain microvascular endothelial cells against oxygen-glucose deprivation via targeting miR-145,Bioscience Reports,Lanfen Ren;Chunxia Wei;Kui Li;Zuneng Lu,2020-07-21,2020.0,2.0,False,Zuneng Lu
64973,4718,20240,MicroRNA-31-5p regulates chemosensitivity by preventing the nuclear location of PARP1 in hepatocellular carcinoma,Journal of Experimental & Clinical Cancer Research: (CR),Ke Ting Que;Yun Zhou;Yu You;Zhen Zhang;Ziao Ping Zhao;Jian Ping Gong;Zuo Jin Liu,2019-04-02,2019.0,1.0,False,Zuo Jin Liu
64974,6188,19118,Flow Chart of Methanol in China,Renewable and Sustainable Energy Reviews,Li Wang Su;Xiang Rong Li;Zuo Yu Sun,2018-08-23,2018.0,4.0,False,Zuo Yu Sun
64975,6873,17848,Lentivirus-Mediated Short-Hairpin RNA Targeting Protein Phosphatase 4 Regulatory Subunit 1 Inhibits Growth in Breast Cancer,Journal of Breast Cancer,Yuying Qi;Tinghui Hu;Kai Lin;Renqing Ye;Zuodong Ye,2018-03-31,2018.0,1.0,False,Zuodong Ye


In [71]:
# Identifying authors that were retracted multiple times

df_numRetracted = df_exploded.groupby('AuthorName')['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'nRetracted'})

# merging with rw

df_rw3 = df_exploded.merge(df_numRetracted, on='AuthorName')
df_rw3.head(1)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted
0,27832,Enantioselective Organocatalytic Hantzsch Synthesis of Polyhydroquinolines,Organic Letters,Christopher G Evans;Jason E Gestwicki,2014-10-02,2014.0,1.0,False,Christopher G Evans,1


In [72]:
# Convert 'RetractionDate' to datetime
df_rw3['RetractionDate'] = pd.to_datetime(df_rw3['RetractionDate'])

# Sort the DataFrame by 'AuthorName' and 'RetractionDate'
df_rw3 = df_rw3.sort_values(by=['AuthorName', 'RetractionDate'])

# Group by 'AuthorName' and calculate the difference in months
df_rw3['MonthsDiff'] = df_rw3.groupby('AuthorName')['RetractionDate'].diff().dt.days / 30

df_rw3[df_rw3['nRetracted'].ge(2)].head(2)

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,MonthsDiff
884,25214,Properties and Compatibility of Microflora for Creating Starter Cultures in Sausage Production Technology,International Journal of Recent Technology and Engineering,A A Nesterenko;A G Koshchaev;N V Keniiz;R S Omarov;S N Shlykov,2020-09-25,2020.0,1.0,False,A A Nesterenko,3,
883,25101,"Development of Technology for Producing Organic Pork with the Introduce of Probiotics, Prebiotics and Synbiotics into the Diet",International Journal of Innovative Technology and Exploring Engineering,N N Zabashta;A A Nesterenko;A Zabashta;V I Guzenko;E N Chernobai,2020-12-07,2020.0,1.0,False,A A Nesterenko,3,2.433333


In [85]:
# Only extracting those that were retracted multiple times
df_repeated_offenders = df_rw3[df_rw3['nRetracted'].ge(2)]

# Now only getting the first two retractions for each author to check those that were retracted within same year

df_repOff_top2 = df_repeated_offenders.sort_values(by=['AuthorName','RetractionDate'])\
                                    .groupby('AuthorName').head(2)\
                                    .groupby('AuthorName').tail(1)

df_repOff_top2[df_repOff_top2.MonthsDiff.lt(12) & df_repOff_top2['RetractionYear'].le(2015) & 
                df_repOff_top2.nRetracted.le(3)]\
                    .sort_values(by='MonthsDiff')

Unnamed: 0,Record ID,Title,Journal,Author,RetractionDate,RetractionYear,bulkCounts,RetractedInBulk,AuthorName,nRetracted,MonthsDiff
44648,3998,"Preventive effect of SA13353 [1-[2-(1-adamantyl)ethyl]-1-pentyl-3-[3-(4-pyridyl)propyl]urea], a novel transient receptor potential vanilloid 1 agonist, on ischemia/reperfusion-induced renal injury in rats",The Journal of Pharmacology and Experimental Therapeutics,Kyoko Ueda;Fumio Tsuji;Tomoko Hirata;Kenji Ueda;Masaaki Murai;Hiroyuki Aono;Masanori Takaoka;Yasuo Matsumura,2014-04-08,2014.0,4.0,False,Kyoko Ueda,3,0.000000
53115,1220,"Treating myocardial stunning randomly, with either propofol or isoflurane following transient coronary occlusion and reperfusion in pigs",Annals of Cardiac Anaesthesia,Felipe Urdaneta;Emilio Lobato;David Kirby;Avner Sidi,2011-01-01,2011.0,1.0,False,Felipe Urdaneta,3,0.000000
54915,1250,The European and English recommendation on the role of biotherapies in moderate to severe psoriasis,Annales de Dermatologie et de VÃ©nÃ©rÃ©ologie,Herve Bachelez;Maxime Battistella,2011-02-01,2011.0,2.0,False,Maxime Battistella,2,0.000000
35110,4598,Upregulation of Ras/Raf/ERK1/2 signaling and ERK5 in the brain of autistic subjects,"Genes, Brain, and Behavior",K Yang;Ashfaq M Sheikh;Mazhar Malik;Guang Wen;Huachang Zou;W Ted Brown;Xiaohong Li,2013-06-22,2013.0,2.0,False,Mazhar Malik,3,0.000000
62318,18484,Current Status of Exotic Hadrons,AIP Conference Proceedings,M Alam Saeed;Maqsood Ahmed;Fazal eâ€Aleem,2005-04-20,2005.0,3.0,False,Fazal eâ€Aleem,3,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
51426,16872,Essential Role of Sphingosine-1-Phosphate Receptor 1-Bearing CD8+CD44+CCR7+ T Cells in Acute Skin Allograft Rejection,American Journal of Transplantation,He Yuling;X Ruijing;J Xiang;Xie Luokun;Y Wenjun;C Feng;H Baojun;Y Hui;Y Guang;Y Chunlei;Z Jixin;C Lang;Q Li;A Chang;B Zhuan;J Youxin;Gong Feili;Tan Jinquan,2011-06-06,2011.0,1.0,False,Xie Luokun,3,11.833333
60212,4557,"Solid Phase Extraction Method for the Determination of Lead, Nickel, Copper and Manganese by Flame Atomic Absorption Spectrometry Using Sodium Bispiperdine-1,1'-carbotetrathioate (Na-BPCTT) in Water Samples",Journal of Hazardous Materials,Dasari Rekha;Kanchi Suvardhan;J Dilip Kumar;P Subramanyam;Puthalapattu Reddy Prasad;Yeramanchi Lingappa;Pattium Chiranjeevi,2008-01-31,2008.0,3.0,False,J Dilip Kumar,3,11.900000
41728,11255,The role of specific PP2A complexes in the dephosphorylation of yH2AX,Journal of Cell Science,Liping Chen;Yandong Lai;Xiaonian Zhu;Lu Ma;Qing Bai;Iria Vazquez;Yongmei Xiao;Caixia Liu;Caochuan Li;Chen Gao;Zhini He;Xiaowen Zeng;Xiumei Xing;Zhengbao Zhang;Jie Li;Bo Zhang;Qing Wang;Anna A Sablina;William C Hahn;Wen Chen,2015-01-14,2015.0,1.0,False,Liping Chen,2,11.900000
60243,4557,"Solid Phase Extraction Method for the Determination of Lead, Nickel, Copper and Manganese by Flame Atomic Absorption Spectrometry Using Sodium Bispiperdine-1,1'-carbotetrathioate (Na-BPCTT) in Water Samples",Journal of Hazardous Materials,Dasari Rekha;Kanchi Suvardhan;J Dilip Kumar;P Subramanyam;Puthalapattu Reddy Prasad;Yeramanchi Lingappa;Pattium Chiranjeevi,2008-01-31,2008.0,3.0,False,P Subramanyam,2,11.900000


In [87]:
df_repeated_offenders['AuthorName'].nunique()

6407

In [89]:
df_exploded['AuthorName'].nunique()

51434

In [102]:
dftemp = df_repeated_offenders.groupby('AuthorName')['Record ID'].nunique().reset_index()
dftemp[dftemp['Record ID'].ge(80)]

Unnamed: 0,AuthorName,Record ID
6134,Yoshihiro Sato,89
6143,Yoshitaka Fujii,87
