# Preprocessing Reasons

Update (12/3/24): This code is an adaption of the code written before the review round on reason propagation. We will use it to propagate reasons for our new filtered sample. 

- We will first read the manually annotated reasons and then extract the majority reason and majority extractor
- We will then read our new matched papers sample and propagate reasons for them. 
- Finally we will compute stats.

--------

In this notebook, we shall preprocess reasons from round 1 and 2 and combine that with round X. We will then use label propagation to label reasons for the entire dataset. We will also evaluate how well the reason label propagation works. 

The steps are as follows:

1. We will first read the relevant files.
2. Then we will merge the different reasons files to have majority reason, and majority retractor for each paper.
3. We will then merge the reasons with **RW_papers** original, and then propagate reasons for all the records for which there is no reason. In this process, we will leave out 100 papers, and validate our approach through that. 
4. Finally, we will compute stats on **filtered_sample**, on **RW_papers**, and on **matched_sample**.

In [32]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [60]:
# Reading paths
paths = read_config()
# Path to where the old reasons are with their authors
REASONS_ANNOTATED_ROUNDX_PATH = paths['REASONS_ANNOTATED_ROUNDX_PATH']
# Path to our processed papers that were matched with MAG
AUTHORS_W_REASONS_OLD_PATH = paths['AUTHORS_W_REASONS_OLD_PATH']
# Where our processed mAG-RW papers file is
PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH = paths['PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH']
# Path to the original RW papers 
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
# Path to where we will save our processed files
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [61]:
# Loading the file that contains new reasons
df_reasonsX = pd.read_csv(REASONS_ANNOTATED_ROUNDX_PATH)\
                    .rename(columns={'Final Reason': 'ReasonMajority',
                                   'Final Retractor': 'RetractorMajority'})\
                    .drop(columns='MAGPID')\
                    .drop_duplicates()
df_reasonsX

Unnamed: 0,Record ID,ReasonMajority,Reason,RetractorMajority
0,264,H,+Concerns/Issues About Data;+Error in Data;,A
1,455,H,+Error in Analyses;+Error in Methods;+Error in...,A
2,456,H,+Error in Methods;+Results Not Reproducible;,A
3,460,H,+Contamination of Reagents;+Results Not Reprod...,A
4,461,H,+Contamination of Reagents;+Results Not Reprod...,A
...,...,...,...,...
104,17292,U,+Notice - Limited or No Information;,U
105,17303,U,+Notice - Limited or No Information;,U
106,17950,U,+Date of Retraction/Other Unknown;+Notice - La...,U
107,19442,U,+Notice - Unable to Access via current resources;,U


In [62]:
df_reasonsOld = pd.read_csv(AUTHORS_W_REASONS_OLD_PATH, 
                    usecols=['Record ID','reason_concordance',
                            'retractor_concordance']).drop_duplicates().\
                    rename(columns={'reason_concordance':'ReasonMajority',
                                   'retractor_concordance':'RetractorMajority'})

df_reasonsOld

Unnamed: 0,Record ID,ReasonMajority,RetractorMajority
0,455,,
1,3396,H,A
4,3040,H,A
5,2200,H,A
7,2042,H,A
...,...,...,...
2082,5100,P,E
2084,9400,P,E
2085,16501,P,E
2086,8725,M,A


In [63]:
# Let us first merge the two reasons files

df_reasonsOld_woX = df_reasonsOld[~df_reasonsOld['Record ID'].isin(df_reasonsX['Record ID'])]
df_reasonsOld_woX

Unnamed: 0,Record ID,ReasonMajority,RetractorMajority
1,3396,H,A
4,3040,H,A
5,2200,H,A
7,2042,H,A
8,197,H,A
...,...,...,...
2082,5100,P,E
2084,9400,P,E
2085,16501,P,E
2086,8725,M,A


In [64]:
df_reasons = pd.concat([df_reasonsOld_woX, df_reasonsX]).drop(columns=['Reason'])
df_reasons

Unnamed: 0,Record ID,ReasonMajority,RetractorMajority
1,3396,H,A
4,3040,H,A
5,2200,H,A
7,2042,H,A
8,197,H,A
...,...,...,...
104,17292,U,U
105,17303,U,U
106,17950,U,U
107,19442,U,U


In [65]:
df_reasons.RetractorMajority.value_counts()

RetractorMajority
A            580
E            506
U            106
C             30
ambiguous     28
Name: count, dtype: int64

In [66]:
df_reasons.ReasonMajority.value_counts()

ReasonMajority
H            347
P            311
M            251
O            170
U            121
ambiguous     25
H/O            8
M/P            8
O/P            3
H/M            2
H/P            2
M/O            2
Name: count, dtype: int64

In [67]:
# Let us map reasons to proper names
reason_mapping = {'H': 'mistake',
                  'P': 'plagiarism',
                  'M': 'misconduct',
                  'O': 'other',
                  'U': 'unknown',
                  'M/P': 'ambiguous',
                  'H/O': 'ambiguous',
                  'M/O': 'ambiguous',
                  'H/P': 'ambiguous',
                  'H/M': 'ambiguous',
                  'O/P': 'ambiguous'}

retractor_mapping = {'A': 'author',
                    'E': 'journal',
                    'U': 'unknown',
                    'C': 'author'}

df_reasons['RetractorMajority'] = df_reasons['RetractorMajority'].replace(retractor_mapping)
df_reasons['ReasonMajority'] = df_reasons['ReasonMajority'].replace(reason_mapping)
df_reasons

Unnamed: 0,Record ID,ReasonMajority,RetractorMajority
1,3396,mistake,author
4,3040,mistake,author
5,2200,mistake,author
7,2042,mistake,author
8,197,mistake,author
...,...,...,...
104,17292,unknown,unknown
105,17303,unknown,unknown
106,17950,unknown,unknown
107,19442,unknown,unknown


In [68]:
df_reasons.RetractorMajority.value_counts()

RetractorMajority
author       610
journal      506
unknown      106
ambiguous     28
Name: count, dtype: int64

In [69]:
df_reasons.ReasonMajority.value_counts()

ReasonMajority
mistake       347
plagiarism    311
misconduct    251
other         170
unknown       121
ambiguous      50
Name: count, dtype: int64

In [70]:
df_reasons['Record ID'].nunique()

1250

In [72]:
# Reading the RW papers merged with MAG (this isn't really our filtered sample for post review iteration)
df_filtered_sample = pd.read_csv(PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH)\

df_filtered_sample.columns

Index(['Record ID', 'MAGPID', 'RWTitleNorm', 'MAGTitle',
       'RecordMatchingMethod', 'FuzzyScore', 'RWPubYear', 'MAGPubYear',
       'RetractionYear', 'RecordMatchingMethodStep2'],
      dtype='object')

In [73]:
#df_matched_sample = pd.read_csv(indir+"/RWMatched_wClosestMatch.csv")

In [75]:
df_filtered_sample[df_filtered_sample['Record ID'].isin(df_reasons['Record ID'].unique())]['Record ID'].nunique()

1227

In [76]:
#df_matched_sample[df_matched_sample.MAGPID.isin(df_reasons.MAGPID.unique())].MAGPID.nunique()

## Label Propagation

In [78]:
# Let us now propagate reasons

# Loading all RW papers with Reasons given by RW

df_papers = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID','Reason']).\
            drop_duplicates()
df_papers.head()

Unnamed: 0,Record ID,Reason
0,28599,+Duplication of Image;+Unreliable Data;
1,28504,+Error in Analyses;+Error in Data;+Error in Me...
2,28506,+Error in Data;+Error in Methods;+Unreliable R...
3,28505,+Concerns/Issues About Data;+Duplication of Im...
4,28502,+Plagiarism of Article;


In [79]:
# Now let us assign each reason from RW to annotated reason

df_papers_w_reasons = df_papers.merge(df_reasons, on='Record ID', how='left')
df_papers_w_reasons['Reason'] = df_papers_w_reasons['Reason'].str.split(";").str[:-1]
df_papers_w_reasons.sort_values(by='ReasonMajority')

Unnamed: 0,Record ID,Reason,ReasonMajority,RetractorMajority
10218,8450,[+Breach of Policy by Author],ambiguous,journal
11006,9202,[+Duplicate Publication through Error by Journ...,ambiguous,journal
10977,3351,"[+Concerns/Issues About Data, +Investigation b...",ambiguous,author
12475,18226,[+Duplication of Article],ambiguous,journal
22095,17286,"[+Notice - Limited or No Information, +Withdra...",ambiguous,author
...,...,...,...,...
26499,10433,"[+Error in Results and/or Conclusions, +Unreli...",,
26500,18529,[+Copyright Claims],,
26501,18531,[+Copyright Claims],,
26502,17246,[+Error in Analyses],,


In [81]:
# Now let us go through each paper that is annotated, and assign the RW reason to a list of annotated reasons

dict_rwreason_to_annotatedreason = {}

df_papers_w_reasons_noNa = df_papers_w_reasons[~df_papers_w_reasons['ReasonMajority'].isna()]

for index, row in df_papers_w_reasons_noNa.iterrows():
    lst_rw_reasons = row['Reason']
    annotated_reason = row['ReasonMajority']
    
    # Going through list of RW reasons for the current paper
    for rw_reason in lst_rw_reasons:
        
        # If it is already in the dictionary, add the annotated reason to the list of RW reason
        if(dict_rwreason_to_annotatedreason.get(rw_reason)):
            dict_rwreason_to_annotatedreason[rw_reason].append(annotated_reason)
        # Else create new list
        else:
            dict_rwreason_to_annotatedreason[rw_reason] = [annotated_reason]

'''
Now that we have mapped each reason in RW to a list of reasons annotated by humans, 
we will first augment the df_ with Reason category, and then, we will 
go through each paper in df_papers and assign the RW reasons a category. 
'''

# Let us extract papers that are relevant

df_papers_w_reasons_relevant = df_papers_w_reasons[df_papers_w_reasons['Record ID'].\
                                                    isin(df_filtered_sample['Record ID'].unique())].copy()

'''
Now we will go through each paper, check if reason_concordance is annotated, if so return that 
for both the columns. If not, then we will take in the list of reasons, go through each reason, 
assign each reason a list of annotated mapped reasons, and then take majority vote using two methods: 
1) overall majority
2) majority of majority votes
'''

from collections import Counter

def find_most_common_reason(lst):
    """
    Given a list of reasons, this function will 
    find the most common reason based on the following 
    logic: If there is only one max element, 
    return that, else it will return the reasons in the 
    following order: M, P, H, O, U
    """
    # Let us first count the number of occurences
    counts = Counter(lst)
    
    # Now we extract the max count
    max_count = counts.most_common(1)[0][1]

    # Now we check which values had the max count -- could be more than 1
    most_commons = [value for value, count in counts.most_common() if count == max_count]

    # If it is indeed more than one, then we choose in the order of M,P,H,O,U

    reason = most_commons[0]
    
    if(len(most_commons) > 1):
        if("misconduct" in most_commons):
            reason = "misconduct"
        elif("plagiarism" in most_commons):
            reason = "plagiarism"
        elif("mistake" in most_commons):
            reason = "mistake"
        elif("ambiguous" in most_commons):
            reason = "ambiguous"
        elif("other" in most_commons):
            reason = "other"
        else:
            reason = "unknown"

    return reason
    
def reason_propagation(row):
    """
    This function will be used to propagate reasons for the 
    papers that were not annotated manually.
    """
    
    if not pd.isnull(row['ReasonMajority']):
        # If paper was annotated, we use the label as is.
        return pd.Series([row['ReasonMajority'], row['ReasonMajority']])
    
    else:
        # If the paper was not annotated we use the label propagation algorithm
        lst_mapped_reasons_overall = []
        lst_mapped_reasons_mOfm = []
        
        # We go through each reason mentioned in the RW dataset
        for rw_reason in row['Reason']:
            # We extract the list of annotated reasons mapped to this particular reason
            mapped_reasons = dict_rwreason_to_annotatedreason.get(rw_reason,[])
            
            # We check if the mapping existed, if not we do nothing
            if len(mapped_reasons) > 0:
                
                # If the reason was mapped to list of reasons in "M,H,P,O,U"
                # Then we append that list to our overall list for an overall voting we shall do later
                lst_mapped_reasons_overall = lst_mapped_reasons_overall + mapped_reasons
                
                # For the Majority of majority voting, we shall first extract the local majority
                lst_mapped_reasons_mOfm.append(find_most_common_reason(mapped_reasons))
                
        # Now we shall find the most common reason for both the lists
        return pd.Series([find_most_common_reason(lst_mapped_reasons_overall),
                         find_most_common_reason(lst_mapped_reasons_mOfm)])

df_papers_w_reasons_relevant[['ReasonPropagatedOverallMajority',
                              'ReasonPropagatedMajorityOfMajority']] = df_papers_w_reasons_relevant.\
                                                                            apply(reason_propagation,axis=1)
df_papers_w_reasons_relevant

Unnamed: 0,Record ID,Reason,ReasonMajority,RetractorMajority,ReasonPropagatedOverallMajority,ReasonPropagatedMajorityOfMajority
5106,18798,"[+Date of Retraction/Other Unknown, +Plagiaris...",,,plagiarism,plagiarism
10212,6578,"[+Fake Peer Review, +Investigation by Journal/...",,,misconduct,misconduct
10213,6582,[+Fake Peer Review],,,misconduct,misconduct
10214,6579,[+Fake Peer Review],,,misconduct,misconduct
10215,6580,"[+Fake Peer Review, +Investigation by Journal/...",,,misconduct,misconduct
...,...,...,...,...,...,...
26381,2621,[+Notice - Unable to Access via current resour...,,,unknown,unknown
26382,1778,"[+Unreliable Data, +Unreliable Results]",,,mistake,mistake
26383,3389,"[+Concerns/Issues About Results, +Contaminatio...",,,mistake,mistake
26384,4325,[+Error in Data],,,mistake,mistake


In [82]:
df_filtered_sample.MAGPID.nunique()

6188

In [83]:
df_papers_w_reasons_relevant['Record ID'].nunique()

6188

In [84]:
df_papers_w_reasons_relevant.drop_duplicates(subset=['Record ID']).\
    ReasonPropagatedMajorityOfMajority.value_counts()

ReasonPropagatedMajorityOfMajority
plagiarism    2081
misconduct    1717
mistake       1301
unknown        655
other          384
ambiguous       50
Name: count, dtype: int64

In [95]:
df_papers_w_reasons_relevant.ReasonPropagatedOverallMajority.value_counts()

ReasonPropagatedOverallMajority
plagiarism    2075
misconduct    1446
mistake       1436
unknown        775
other          406
ambiguous       50
Name: count, dtype: int64

In [89]:
df_papers_w_reasons_relevant[~df_papers_w_reasons_relevant['RetractorMajority'].isna()]

Unnamed: 0,Record ID,Reason,ReasonMajority,RetractorMajority,ReasonPropagatedOverallMajority,ReasonPropagatedMajorityOfMajority
10218,8450,[+Breach of Policy by Author],ambiguous,journal,ambiguous,ambiguous
10221,4312,[+Withdrawal],unknown,unknown,unknown,unknown
10224,2196,"[+Notice - Limited or No Information, +Withdra...",unknown,unknown,unknown,unknown
10225,12992,"[+Error by Journal/Publisher, +Retract and Rep...",other,journal,other,other
10233,8102,"[+Notice - Limited or No Information, +Withdra...",unknown,unknown,unknown,unknown
...,...,...,...,...,...,...
26363,772,"[+Error in Data, +Error in Methods]",mistake,author,mistake,mistake
26364,173,[Notice - No/Limited Information],plagiarism,journal,plagiarism,plagiarism
26368,3031,"[+Contamination of Cell Lines/Tissues, +Error ...",mistake,author,mistake,mistake
26372,4549,"[+Concerns/Issues About Data, +Results Not Rep...",mistake,author,mistake,mistake


In [92]:
# Saving this file for later merging

# Saving

FILENAME = "propagated_reasons_for_paper_matched_sample"

# Create a full file path with timestamp
file_path = os.path.join(OUTDIR, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_papers_w_reasons_relevant.to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")


File saved successfully


In [94]:
# Just doing some sensibility checks

df_papers_w_reasons_relevant[df_papers_w_reasons_relevant['ReasonMajority']==\
                             df_papers_w_reasons_relevant['ReasonPropagatedMajorityOfMajority']]

Unnamed: 0,Record ID,Reason,ReasonMajority,RetractorMajority,ReasonPropagatedOverallMajority,ReasonPropagatedMajorityOfMajority
10218,8450,[+Breach of Policy by Author],ambiguous,journal,ambiguous,ambiguous
10221,4312,[+Withdrawal],unknown,unknown,unknown,unknown
10224,2196,"[+Notice - Limited or No Information, +Withdra...",unknown,unknown,unknown,unknown
10225,12992,"[+Error by Journal/Publisher, +Retract and Rep...",other,journal,other,other
10233,8102,"[+Notice - Limited or No Information, +Withdra...",unknown,unknown,unknown,unknown
...,...,...,...,...,...,...
26363,772,"[+Error in Data, +Error in Methods]",mistake,author,mistake,mistake
26364,173,[Notice - No/Limited Information],plagiarism,journal,plagiarism,plagiarism
26368,3031,"[+Contamination of Cell Lines/Tissues, +Error ...",mistake,author,mistake,mistake
26372,4549,"[+Concerns/Issues About Data, +Results Not Rep...",mistake,author,mistake,mistake


In [97]:
# These are papers where our two logics don't match. But for others it does.
df_papers_w_reasons_relevant[df_papers_w_reasons_relevant['ReasonPropagatedOverallMajority']!=\
                             df_papers_w_reasons_relevant['ReasonPropagatedMajorityOfMajority']]

Unnamed: 0,Record ID,Reason,ReasonMajority,RetractorMajority,ReasonPropagatedOverallMajority,ReasonPropagatedMajorityOfMajority
10222,4232,[+Concerns/Issues about Referencing/Attributio...,,,mistake,plagiarism
10228,11633,"[+Duplication of Article, +Duplication of Data...",,,misconduct,plagiarism
10265,8239,"[+Error by Journal/Publisher, +Lack of Approva...",,,other,misconduct
10286,1085,"[+Ethical Violations by Author, +Investigation...",,,unknown,misconduct
10299,22935,"[+Concerns/Issues About Data, +Lack of Approva...",,,mistake,misconduct
...,...,...,...,...,...,...
26282,2844,"[+Civil Proceedings, +Concerns/Issues About Da...",,,mistake,misconduct
26317,18602,"[+Investigation by Company/Institution, +Resul...",,,mistake,misconduct
26345,4664,"[+Error in Analyses, +Error in Methods, +Euphe...",,,mistake,misconduct
26359,3391,"[+Investigation by Company/Institution, +Inves...",,,mistake,misconduct
