## Processing Paper Matches

In this notebook, we shall process the RW-MAG paper matches, and consolidate the matches based on different criteria into a single csv file. We shall also remove the retraction notices from the MAG matches.

## The process of filtering records, and matches


0. **Remove retraction notices in RW**
1. **Remove all the bulk retractions**
2. **Remove all records with duplicate titles in RW**
3. **Remove all records beyond 1990-2015**
4. **Paper Matching**
       a. Exact DOI
       b. Exact Title
       c. Exact year fuzzy title
           c1. Keep all papers with the same DOI
           c2. Matches to hard code: MAGPID: 2418262483 for 4465 and 3011105395 for 24881
           c3. Remove all the matches that have "Retraction Note" or "Retraction Notice"
           c4. We will keep all the records with the same retraction year as publication year.
       d. Fuzzy year fuzzy matching
           d1. Keep all papers with the same DOI
           d3. Remove all the matches that have "Retraction Note" or "Retraction Notice"
           d4. We will keep all the records with the same retraction year as publication year.
5. Compile all the matches into a single dataframe and save it

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [101]:
# Reading paths
paths = read_config()
FUZZYMATCH_LOCAL_PATH = paths['FUZZYMATCH_LOCAL_PATH']
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH = paths['PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH']
RETRACTION_NOTICES_LOCAL_PATH = paths['RETRACTION_NOTICES_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [3]:
# Reading list of all files in fuzzy match directory
flist = os.listdir(FUZZYMATCH_LOCAL_PATH)

In [93]:
# Reading retraction watch dataset
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID', 'Title', 'RetractionDate' , 
                                                            'RetractionYear', 'OriginalPaperYear',
                                                           'RetractionDOI','OriginalPaperDOI',
                                                            'RetractionPubMedID', 'OriginalPaperPubMedID',
                                                           'Journal', 'ArticleType', 'Reason'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,OriginalPaperYear,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Research Article;,2021-05-15,10.1007/s12035-021-02424-8,33991321,10.1007/s12035-016-0248-x,27822714,+Duplication of Image;+Unreliable Data;,2016.0,2021.0


# 0. Identifying Retraction Notices RW

In [5]:
# We are extracting all articletypes that contain "Retraction notice" as a keyword (very few records)
records_removed_notices_from_RW = df_rw[~df_rw['ArticleType'].isna() & df_rw['ArticleType']\
                                        .str.contains('Retraction Notice')]['Record ID'].unique()

print(f"Number of records identified as retraction notices in RW: {len(records_removed_notices_from_RW)}")

Number of records identified as retraction notices in RW: 6


# 1. Identifying bulk retractions

In [6]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw_temp = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Records to be removed due to bulk retractions
records_removed_bulk_retractions = df_rw_temp[df_rw_temp['bulkCounts'].ge(5)]['Record ID'].unique()

print(f"Number of records identified in bulk retractions: {len(records_removed_bulk_retractions)}")

Number of records identified in bulk retractions: 11357


In [7]:
# printing the difference

len(set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

11356

# 2. Identifying duplicate title records

In [8]:
# Extracting records with duplicate titles in RW
records_removed_duplicate_titles = df_rw[df_rw['Title'].duplicated(keep=False)]\
                                        .sort_values(by=['Title'])['Record ID'].unique()

print(f"Number of records identified in duplicate titles: {len(records_removed_duplicate_titles)}")

Number of records identified in duplicate titles: 123


In [9]:
# printing the difference

len(set(records_removed_duplicate_titles)-set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

108

# 3. Identifying records beyond 1990-2015

In [10]:
# extracting records that are not in our window
records_removed_1990_2015 = df_rw[df_rw['RetractionYear'].lt(1990) | 
                                 df_rw['RetractionYear'].gt(2015)]\
                                        ['Record ID'].unique()

print(f"Number of records identified beyond 1990-2015: {len(records_removed_1990_2015)}")

Number of records identified beyond 1990-2015: 10321


In [11]:
# looking at the difference in the number of records due to this specific filter
len(set(records_removed_1990_2015)-set(records_removed_duplicate_titles)\
        -set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

8330

# * Creating big list of records removed so far

In [12]:
# Finally creating a big list of records that we just have to remove due to filtering process
records_filtered = set(list(records_removed_1990_2015)+list(records_removed_duplicate_titles)\
        +list(records_removed_bulk_retractions)+list(records_removed_notices_from_RW))

print(f"Number of records removed due to the above filters {len(records_filtered)}")

Number of records removed due to the above filters 19800


# 4. Paper Matching

In [13]:
def extract_record_summary(dfi):
    """
    This function shall be used to 
    give a summary in terms of the 
    (a) total records in the dataframe (dfi)
    as well as the 
    (b) total records between 1990-2015, as well as 
    (c) the average number of matches per record for 
    both (a) and (b) and also the 
    (d) the max number of matches for (a) and (b), and also
    (e) the number of records with more than 1 fuzzy match
    """
    
    print("Number of unique records:", dfi['Record ID'].nunique())
    print("Average number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfj = dfi.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match", 
          dfj[dfj['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match",
         dfj[dfj['NumMatches'].gt(1)]['Record ID'].nunique())
    
    
    print("###########")
    records_1990_2015 = dfi[dfi['RetractionYear'].ge(1990) & dfi['RetractionYear'].le(2015)]
    print("Number of unique records retracted between 1990-2015:", 
          records_1990_2015['Record ID'].nunique())
    print("Average number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfk = records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match for 1990-2015", 
          dfk[dfk['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match for 1990-2015",
         dfk[dfk['NumMatches'].gt(1)]['Record ID'].nunique())
    

## a. and b. Processing Exact Matching (based on DOI or Title)

In [14]:
# Let us explore exact matching first

# Reading papers that were matched by exact matching
df = pd.read_csv(PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH)
df = df.merge(df_rw.drop(columns=['OriginalPaperDOI']), on='Record ID')
df.head(1)

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperPubMedID,Reason,RetractionYear
0,28505,tet1 exerts its tumor suppressor function by r...,10.1042/BSR20160523,2597493214,tet1 exerts its tumour suppressor function by ...,2017.0,TET1 exerts its tumor suppressor function by r...,Bioscience Reports,Research Article;,2021-05-14,10.1042/BSR-20160523_RET,33988682,28341638,+Concerns/Issues About Data;+Duplication of Im...,2021.0


In [15]:
# Extracting those that were matched only on DOI i.e. their DOI != NaN
matched_doi = df[~df['OriginalPaperDOI'].isna()]

# Let us remove all the records that were filtered
matched_doi = matched_doi[~matched_doi['Record ID'].isin(records_filtered)]

extract_record_summary(matched_doi)

Number of unique records: 1625
Average number of matches per record 1.0104615384615385
Max number of matches per record 4
Records with exactly 1 match 1613
Records with more than 1 match 12
###########
Number of unique records retracted between 1990-2015: 1625
Average number of matches per record for 1990-2015 1.0104615384615385
Max number of matches per record for 1990-2015 4
Records with exactly 1 match for 1990-2015 1613
Records with more than 1 match for 1990-2015 12


In [17]:
# Extracting records that were matched on title (coz no DOI)
matched_title = df[df.OriginalPaperDOI.isna()]

# Let us remove all the records that were filtered
matched_title = matched_title[~matched_title['Record ID'].isin(records_filtered)]

#matched_title = matched_title[matched_title['Record ID'].isin(matched_doi['Record ID'].unique())]

extract_record_summary(matched_title)

Number of unique records: 1021
Average number of matches per record 1.1547502448579823
Max number of matches per record 20
Records with exactly 1 match 926
Records with more than 1 match 95
###########
Number of unique records retracted between 1990-2015: 1021
Average number of matches per record for 1990-2015 1.1547502448579823
Max number of matches per record for 1990-2015 20
Records with exactly 1 match for 1990-2015 926
Records with more than 1 match for 1990-2015 95


# c. and d. Processing Fuzzy Matching

In [19]:
PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH

'/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/fuzzymatch/RW_MAG_exact_paper_matched.csv'

In [20]:
# Initializing the two lists for two ways we did fuzzy matching
dfs_exactyear = []
dfs_fuzzyyear = []

exact_match_fname = PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH.split("/")[-1]

# going through the file list
for fname in flist:
    # Only reading if it is not exact match
    if fname != exact_match_fname:
        df = pd.read_csv(FUZZYMATCH_LOCAL_PATH+fname)
        # If it is exact year fuzzy matching
        if "exact_year" in fname:
            dfs_exactyear.append(df)
        # If it is fuzzy year fuzzy matching
        else:
            dfs_fuzzyyear.append(df)

## c. Processing exact year fuzzy matching

In [21]:
# Processing exact year fuzzy matching
df_exactyear = pd.concat(dfs_exactyear)
df_exactyear = df_exactyear.merge(df_rw, on='Record ID')

# Removing records that were filtered
df_exactyear = df_exactyear[~df_exactyear['Record ID'].isin(records_filtered)]

df_exactyear.head(2)

Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear
76,retracted use of upper triangular matrix track...,95.0,7148009,2028375193,retracted use of upper triangular matrix track...,2013.0,18798,use of upper triangular matrix tracking for co...,Use of upper triangular matrix tracking for co...,Signal Processing,Research Article;,2013-09-01,10.1016/j.sigpro.2013.01.002,0,10.1016/j.sigpro.2013.01.002,0,+Date of Retraction/Other Unknown;+Plagiarism ...,2013.0
198,retracted gamma glutamyl transferase activity ...,95.0,7380434,2003800286,retracted gamma glutamyl transferase activity ...,2013.0,9165,gamma-glutamyl transferase activity in kids bo...,Gamma-Glutamyl Transferase Activity in Kids Bo...,Food and Nutrition Sciences,Research Article;,2015-12-15,10.4236/fns.2013.46A006,0,10.4236/fns.2013.46A006,0,+Falsification/Fabrication of Data;+Falsificat...,2015.0


In [22]:
# summarizing exact year fuzzy matching
extract_record_summary(df_exactyear)

Number of unique records: 3137
Average number of matches per record 1.1976410583359898
Max number of matches per record 3
Records with exactly 1 match 2596
Records with more than 1 match 541
###########
Number of unique records retracted between 1990-2015: 3137
Average number of matches per record for 1990-2015 1.1976410583359898
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2596
Records with more than 1 match for 1990-2015 541


In [24]:
# Let us now define a list of papers to keep
# These will be those that are 
# a) in df_exactyear and have retraction doi same as original paper doi
# b) in df_exactyear and do not have "Retraction Note" or "Retraction Notice" in their title
# c) in df_exactyear and have publication year same as retraction year

# extracting records that have same paper doi as retraction doi
records_same_doi = df_rw[df_rw['OriginalPaperDOI'].eq(df_rw['RetractionDOI']) & 
                         ~df_rw['RetractionDOI'].isin(['unavailable','Unavailable'])]['Record ID'].unique()

# extracting the dataframe/matches for those
df_exactyear2 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi)]

extract_record_summary(df_exactyear2)


Number of unique records: 472
Average number of matches per record 1.1398305084745763
Max number of matches per record 3
Records with exactly 1 match 412
Records with more than 1 match 60
###########
Number of unique records retracted between 1990-2015: 472
Average number of matches per record for 1990-2015 1.1398305084745763
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 412
Records with more than 1 match for 1990-2015 60


In [25]:
# Adding records that were not in retraction notices


# Now let us read retraction notices and remove them from matches
df_retraction_notices = pd.read_csv(RETRACTION_NOTICES_LOCAL_PATH)

# extracting records that were fuzyy matched but were not in retraction notices
records_not_in_notices = df_exactyear[~df_exactyear['MAGPID'].\
                                        isin(df_retraction_notices['PID'])]['Record ID'].unique()

# extracting records that either have same doi or are in retraction notice
df_exactyear22 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi) | 
                             df_exactyear['Record ID'].isin(records_not_in_notices)]

extract_record_summary(df_exactyear22)

Number of unique records: 3029
Average number of matches per record 1.2040277319247277
Max number of matches per record 3
Records with exactly 1 match 2490
Records with more than 1 match 539
###########
Number of unique records retracted between 1990-2015: 3029
Average number of matches per record for 1990-2015 1.2040277319247277
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2490
Records with more than 1 match for 1990-2015 539


In [26]:
# Adding records with same pubyear as retraction year
records_same_pubyear_ryear = df_exactyear[df_exactyear['MAGPubYear'].eq(df_exactyear['RetractionYear'])]\
                                ['Record ID'].unique()

# Extracting all records matched fuzzily, but have same doi OR same year OR not in retraction notice
# All these records are valid
df_exactyear3 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi) | 
                            df_exactyear['Record ID'].isin(records_same_pubyear_ryear) | 
                            df_exactyear['Record ID'].isin(records_not_in_notices)]

extract_record_summary(df_exactyear3)

Number of unique records: 3097
Average number of matches per record 1.2001937358734258
Max number of matches per record 3
Records with exactly 1 match 2556
Records with more than 1 match 541
###########
Number of unique records retracted between 1990-2015: 3097
Average number of matches per record for 1990-2015 1.2001937358734258
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2556
Records with more than 1 match for 1990-2015 541


In [29]:
# removing those with retraction notice keywords

df_exactyear4 = df_exactyear3[~df_exactyear3['MAGTitle'].str.contains('Retraction Notice') & 
                             ~df_exactyear3['MAGTitle'].str.contains('Retraction Note')]

extract_record_summary(df_exactyear4)

Number of unique records: 3097
Average number of matches per record 1.2001937358734258
Max number of matches per record 3
Records with exactly 1 match 2556
Records with more than 1 match 541
###########
Number of unique records retracted between 1990-2015: 3097
Average number of matches per record for 1990-2015 1.2001937358734258
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2556
Records with more than 1 match for 1990-2015 541


In [34]:
# Checking those that were removed

df_exactyear_remaining = df_exactyear[~df_exactyear['Record ID'].isin(df_exactyear4['Record ID'])]

print(f"# Records for which we are not sure {df_exactyear_remaining['Record ID'].nunique()}")

# Records for which we are not sure 40


## d. Processing fuzzy year fuzzy matching

In [35]:
# processing fuzzy year fuzzy matching
df_fuzzyyear = pd.concat(dfs_fuzzyyear)
df_fuzzyyear = df_fuzzyyear.merge(df_rw, on='Record ID')

# removing those records that were in exact match
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(df_exactyear['Record ID'])]

# removing those records that were filtered
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(records_filtered)]

df_fuzzyyear.head(2)

Unnamed: 0,RWTitleNorm,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear
198,protective effects of bazedoxifene paired with...,retraction protective effects of bazedoxifene ...,92.636816,7060760,2179077735,retraction protective effects of bazedoxifene ...,2016.0,17304,Protective effects of bazedoxifene paired with...,Biological & Pharmaceutical Bulletin,Research Article;,2015-11-06,10.1248/bpb.b15-00585,26548420,10.1248/bpb.b15-00585,26548420,+Concerns/Issues About Authorship;+Conflict of...,2015.0
208,overexpression of thaumatin gene confers enhan...,retracted article overexpression of thaumatin ...,93.215339,4739624,1925258414,retracted article overexpression of thaumatin ...,2016.0,8766,Overexpression of thaumatin gene confers enhan...,"Plant Cell, Tissue and Organ Culture (PCTOC)",Research Article;,2015-08-20,10.1007/s11240-015-0846-8,0,10.1007/s11240-015-0846-8,0,+Lack of Approval from Author;,2015.0


In [36]:
# summarizing fuzzy year fuzzy matching
extract_record_summary(df_fuzzyyear)

Number of unique records: 542
Average number of matches per record 1.2250922509225093
Max number of matches per record 3
Records with exactly 1 match 433
Records with more than 1 match 109
###########
Number of unique records retracted between 1990-2015: 542
Average number of matches per record for 1990-2015 1.2250922509225093
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 433
Records with more than 1 match for 1990-2015 109


In [38]:
# Let us first keep those that have the same doi
# We already have a variable called 'records_same_doi'

# extracting the dataframe/matches for those
df_fuzzyyear2 = df_fuzzyyear[df_fuzzyyear['Record ID'].isin(records_same_doi)]

extract_record_summary(df_fuzzyyear2)

Number of unique records: 149
Average number of matches per record 1.1208053691275168
Max number of matches per record 3
Records with exactly 1 match 132
Records with more than 1 match 17
###########
Number of unique records retracted between 1990-2015: 149
Average number of matches per record for 1990-2015 1.1208053691275168
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 132
Records with more than 1 match for 1990-2015 17


In [39]:
#Extracting records not in retraction notices

# extracting records that were fuzyy matched but were not in retraction notices
records_not_in_notices = df_fuzzyyear[~df_fuzzyyear['MAGPID'].\
                                        isin(df_retraction_notices['PID'])]['Record ID'].unique()

# extracting records that either have same doi or are in retraction notice
df_fuzzyyear22 = df_fuzzyyear[df_fuzzyyear['Record ID'].isin(records_same_doi) | 
                             df_fuzzyyear['Record ID'].isin(records_not_in_notices)]

extract_record_summary(df_fuzzyyear22)

Number of unique records: 522
Average number of matches per record 1.2318007662835249
Max number of matches per record 3
Records with exactly 1 match 414
Records with more than 1 match 108
###########
Number of unique records retracted between 1990-2015: 522
Average number of matches per record for 1990-2015 1.2318007662835249
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 414
Records with more than 1 match for 1990-2015 108


In [40]:
# removing those with retraction notice keywords

df_fuzzyyear3 = df_fuzzyyear22[~df_fuzzyyear22['MAGTitle'].str.contains('Retraction Notice') & 
                             ~df_fuzzyyear22['MAGTitle'].str.contains('Retraction Note')]

extract_record_summary(df_fuzzyyear3)

Number of unique records: 522
Average number of matches per record 1.2318007662835249
Max number of matches per record 3
Records with exactly 1 match 414
Records with more than 1 match 108
###########
Number of unique records retracted between 1990-2015: 522
Average number of matches per record for 1990-2015 1.2318007662835249
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 414
Records with more than 1 match for 1990-2015 108


# Compilation

Let us now compile a dataframe of all the possible matches b/w RW and MAG.

We shall also remove all the matches that were identified as retraction notices or contained the 
words 'Retraction Notice' or 'Retraction Note'. 

The columns we are interested in are:

1) Record ID

2) MAGPID

3) RW Title

4) MAG Title

5) Retraction Year

6) Fuzzy match score if applicable

7) Method of matching (doi, title, exactYearFuzzyTitle, fuzzyYearfuzzyTitle)

8) MAGPubYear

9) RWOriginalPubYear

In [87]:
# processing doi matched dataframe
imp_cols = ['Record ID', 'RWTitleNorm', 'MAGPID', 'MAGTitle', 'RetractionYear','MAGPubYear']
df_matched_doi = matched_doi[imp_cols].drop_duplicates()\
                                    .reset_index(drop=True)
df_matched_doi['RecordMatchingMethod'] = 'doi'

# processing title matched dataframe
df_matched_title = matched_title[imp_cols].drop_duplicates()\
                                    .reset_index(drop=True)
df_matched_title['RecordMatchingMethod'] = 'title'


######## Start of block ########

# processing exact year fuzzy title matched dataframe
df_matched_exactyear_fuzzytitle = df_exactyear4[imp_cols+['score']].drop_duplicates()\
                                    .rename(columns={'score':'FuzzyScore'})

df_matched_exactyear_fuzzytitle['RecordMatchingMethod'] = 'exactYearfuzzyTitle'

"""
Let us spell out what is happening in this block.

While we have identified records that we will keep based on 
if they have same retraction and paper doi OR
if they have same retraction year as original paper year OR
if they are not part of retraction notices.

What is missing is that in the matches, we haven't removed matches 
that were retraction notices. So we should remove the matches if 
they are for sure retraction notices and don't fall in either of the above 
criteria. This is only for records where there are more than 1 match. 

So we are removing them below.

"""

# Let me separate this dataframe into 2, those that have same original paper doi as retraction doi, and others
df_matched_exactyear_fuzzytitle_samedoiOryear = df_matched_exactyear_fuzzytitle[df_matched_exactyear_fuzzytitle['Record ID'].\
                                                                            isin(records_same_doi) | 
                                                                               df_matched_exactyear_fuzzytitle['Record ID'].\
                                                                            isin(records_same_pubyear_ryear)]

df_matched_exactyear_fuzzytitle_diffdoiAndYear = df_matched_exactyear_fuzzytitle[~df_matched_exactyear_fuzzytitle['Record ID'].\
                                                                            isin(df_matched_exactyear_fuzzytitle_samedoiOryear['Record ID'])]

# Those that have same doi will not be filtered except manually, those that have different, we will remove retraction notices.
# Let us remove matches that were identified as retraction notices 3717
df_matched_exactyear_fuzzytitle_diffdoiAndYear = df_matched_exactyear_fuzzytitle_diffdoiAndYear[~df_matched_exactyear_fuzzytitle_diffdoiAndYear['MAGPID'].isin(df_retraction_notices['PID']) & 
                                               ~df_matched_exactyear_fuzzytitle_diffdoiAndYear['MAGTitle'].str.contains('Retraction Notice') & 
                                               ~df_matched_exactyear_fuzzytitle_diffdoiAndYear['MAGTitle'].str.contains('Retraction Note')]


# Now let us merge the two dataframes back
df_matched_exactyear_fuzzytitle = pd.concat([df_matched_exactyear_fuzzytitle_samedoiOryear,
                                            df_matched_exactyear_fuzzytitle_diffdoiAndYear])\
                                    .reset_index(drop=True)

######## End of Block ########


# Finally processing the fuzzy year fuzzy match
df_matched_fuzzyyear_fuzzytitle_samedoi = df_fuzzyyear2.copy()

df_matched_fuzzyyear_fuzzytitle_diffdoi = df_fuzzyyear3[~df_fuzzyyear3['Record ID'].isin(df_matched_fuzzyyear_fuzzytitle_samedoi['Record ID'])]


df_matched_fuzzyyear_fuzzytitle_diffdoi = df_matched_fuzzyyear_fuzzytitle_diffdoi[~df_matched_fuzzyyear_fuzzytitle_diffdoi['MAGPID'].isin(df_retraction_notices['PID']) &
                                                                                 ~df_matched_fuzzyyear_fuzzytitle_diffdoi['MAGTitle'].str.contains('Retraction Notice') & 
                                                                                 ~df_matched_fuzzyyear_fuzzytitle_diffdoi['MAGTitle'].str.contains('Retraction Note')]

df_matched_fuzzyyear_fuzzytitle = pd.concat([df_matched_fuzzyyear_fuzzytitle_samedoi,
                                            df_matched_fuzzyyear_fuzzytitle_diffdoi])\
                                    .reset_index(drop=True)

df_matched_fuzzyyear_fuzzytitle = df_matched_fuzzyyear_fuzzytitle[imp_cols+['score']].drop_duplicates()\
                                    .rename(columns={'score':'FuzzyScore'})

df_matched_fuzzyyear_fuzzytitle['RecordMatchingMethod'] = 'fuzzyYearfuzzyTitle'


df_matched_fuzzyyear_fuzzytitle
# Now we merge the four dataframes

df_paper_matching = pd.concat([df_matched_doi,
                              df_matched_title,
                              df_matched_exactyear_fuzzytitle,
                              df_matched_fuzzyyear_fuzzytitle])

extract_record_summary(df_paper_matching)

Number of unique records: 6265
Average number of matches per record 1.1297685554668795
Max number of matches per record 20
Records with exactly 1 match 5595
Records with more than 1 match 670
###########
Number of unique records retracted between 1990-2015: 6265
Average number of matches per record for 1990-2015 1.1297685554668795
Max number of matches per record for 1990-2015 20
Records with exactly 1 match for 1990-2015 5595
Records with more than 1 match for 1990-2015 670


# Removing anomalies: If Retraction year is less than MAG publication year

In [88]:
df_paper_matching_filtered = df_paper_matching[~df_paper_matching['RetractionYear']\
                                                   .lt(df_paper_matching['MAGPubYear'])]

extract_record_summary(df_paper_matching_filtered)

Number of unique records: 6199
Average number of matches per record 1.1269559606388126
Max number of matches per record 18
Records with exactly 1 match 5541
Records with more than 1 match 658
###########
Number of unique records retracted between 1990-2015: 6199
Average number of matches per record for 1990-2015 1.1269559606388126
Max number of matches per record for 1990-2015 18
Records with exactly 1 match for 1990-2015 5541
Records with more than 1 match for 1990-2015 658


In [90]:
df_paper_matching_filtered.head(2)

Unnamed: 0,Record ID,RWTitleNorm,MAGPID,MAGTitle,RetractionYear,MAGPubYear,RecordMatchingMethod,FuzzyScore
0,6582,a study on chinaâ€™s petroleum enterprise soci...,2025784643,retraction note a study on china s petroleum e...,2015.0,2015.0,doi,
1,6579,behavior analysis on the security object of th...,2219765734,behavior analysis on the security object of th...,2015.0,2015.0,doi,


# Adding an extra column on original paper year from RW

In [98]:
df_paper_matching_extended = df_paper_matching_filtered.merge(df_rw[['Record ID', 'OriginalPaperYear']], 
                                                              on='Record ID')\
                                                        .rename(columns={'OriginalPaperYear': 'RWPubYear'})
df_paper_matching_extended.head()

Unnamed: 0,Record ID,RWTitleNorm,MAGPID,MAGTitle,RetractionYear,MAGPubYear,RecordMatchingMethod,FuzzyScore,RWPubYear
0,6582,a study on chinaâ€™s petroleum enterprise soci...,2025784643,retraction note a study on china s petroleum e...,2015.0,2015.0,doi,,2015.0
1,6579,behavior analysis on the security object of th...,2219765734,behavior analysis on the security object of th...,2015.0,2015.0,doi,,2015.0
2,6580,communication arising from relationship orient...,2290303548,communication arising from relationship orient...,2015.0,2015.0,doi,,2015.0
3,6581,prediction model of karstic large spring water...,2230565940,study on prediction model of karstic large spr...,2015.0,2015.0,doi,,2015.0
4,8450,unilaterally blocking the muscarinic receptors...,1913783002,unilaterally blocking the muscarinic receptors...,2015.0,2015.0,doi,,2015.0


# Saving

In [104]:
# Let us first organize columns

col_order = ['Record ID', 'MAGPID', 'RWTitleNorm', 'MAGTitle', 'RecordMatchingMethod',
            'FuzzyScore', 'RWPubYear', 'MAGPubYear' , 'RetractionYear']

import os
from datetime import datetime

# Constants
OUTPUT_DIRECTORY = OUTDIR
FILENAME = "paper_matching"

# Create a full file path with timestamp
timestamp = datetime.now().strftime("%Y%m%d")
file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}_{timestamp}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_paper_matching_extended[col_order].to_csv(file_path, index=False)
    print(f"File saved successfully at {file_path}")
except Exception as e:
    print(f"Error saving file: {e}")
    


File saved successfully at /Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/paper_matching_20240129.csv


In [118]:
# Let us also save the dataframe with duplicates i.e. more than 1 match (to be decided manually)

df_duplicates = df_paper_matching_extended[df_paper_matching_extended['Record ID'].duplicated(keep=False)]\
                .sort_values(by=['Record ID','FuzzyScore'])

FILENAME = "paper_matching_multipleMatches"

# Create a full file path with timestamp
timestamp = datetime.now().strftime("%Y%m%d")
file_path = os.path.join(OUTPUT_DIRECTORY, f"{FILENAME}_{timestamp}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_duplicates[col_order].to_csv(file_path, index=False)
    print(f"File saved successfully at {file_path}")
except Exception as e:
    print(f"Error saving file: {e}")

File saved successfully at /Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/paper_matching_multipleMatches_20240129.csv
