## Processing Paper Matches

In this notebook, we shall process the RW-MAG paper matches, and consolidate the matches based on different criteria into a single csv file. We shall also remove the retraction notices from the MAG matches.

In [71]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [97]:
# Reading paths
paths = read_config()
FUZZYMATCH_LOCAL_PATH = paths['FUZZYMATCH_LOCAL_PATH']
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH = paths['PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH']
RETRACTION_NOTICES_LOCAL_PATH = paths['RETRACTION_NOTICES_LOCAL_PATH']

In [75]:
# Reading list of all files in fuzzy match directory
flist = os.listdir(FUZZYMATCH_LOCAL_PATH)

In [76]:
# Reading retraction watch dataset
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID','RetractionYear'])
df_rw.head()

Unnamed: 0,Record ID,RetractionYear
0,28599,2021.0
1,28504,2021.0
2,28506,2021.0
3,28505,2021.0
4,28502,2021.0


In [103]:
def extract_record_summary(dfi):
    """
    This function shall be used to 
    give a summary in terms of the 
    (a) total records in the dataframe (dfi)
    as well as the 
    (b) total records between 1990-2015, as well as 
    (c) the average number of matches per record for 
    both (a) and (b) and also the 
    (d) the max number of matches for (a) and (b), and also
    (e) the number of records with more than 1 fuzzy match
    """
    
    print("Number of unique records:", dfi['Record ID'].nunique())
    print("Average number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfj = dfi.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match", 
          dfj[dfj['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match",
         dfj[dfj['NumMatches'].gt(1)]['Record ID'].nunique())
    
    
    print("###########")
    records_1990_2015 = dfi[dfi['RetractionYear'].ge(1990) & dfi['RetractionYear'].le(2015)]
    print("Number of unique records retracted between 1990-2015:", 
          records_1990_2015['Record ID'].nunique())
    print("Average number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfk = records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match for 1990-2015", 
          dfk[dfk['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match for 1990-2015",
         dfk[dfk['NumMatches'].gt(1)]['Record ID'].nunique())
    

### Processing Exact Matching (based on DOI or Title)

In [116]:
# Let us explore exact matching first

df = pd.read_csv(PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH)
df = df.merge(df_rw, on='Record ID')
df.head()

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,RetractionYear
0,28505,tet1 exerts its tumor suppressor function by r...,10.1042/BSR20160523,2597493214,tet1 exerts its tumour suppressor function by ...,2017.0,2021.0
1,28498,upregulation of oxidative stress-responsive 1(...,10.1080/21655979.2020.1814659,3081313700,upregulation of oxidative stress responsive 1 ...,2020.0,2021.0
2,28596,the facets of gender inequality and homicide: ...,10.1080/08974454.2019.1632773,2958272459,retracted article the facets of gender inequal...,2019.0,2021.0
3,28351,opioid-sparing effect of modified intercostal ...,10.1097/EJA.0000000000001394,3104158721,opioid sparing effect of modified intercostal ...,2020.0,2021.0
4,28302,beijing as the hub of crf07_bc transmission fr...,10.1089/AID.2020.0147,3112479964,beijing as the hub of crf07_bc transmission fr...,2020.0,2021.0


In [117]:
# Extracting those that were matched only on DOI

matched_doi = df[~df.OriginalPaperDOI.isna()]

extract_record_summary(matched_doi)

Number of unique records: 7906
Average number of matches per record 1.0103718694662283
Max number of matches per record 21
Records with exactly 1 match 7874
Records with more than 1 match 32
###########
Number of unique records retracted between 1990-2015: 5593
Average number of matches per record for 1990-2015 1.0032183086000357
Max number of matches per record for 1990-2015 4
Records with exactly 1 match for 1990-2015 5580
Records with more than 1 match for 1990-2015 13


In [119]:
# Extracting records that were matched on title (coz no DOI)
matched_title = df[df.OriginalPaperDOI.isna()]
extract_record_summary(matched_title)

Number of unique records: 2687
Average number of matches per record 1.2221808708596948
Max number of matches per record 137
Records with exactly 1 match 2479
Records with more than 1 match 208
###########
Number of unique records retracted between 1990-2015: 1272
Average number of matches per record for 1990-2015 1.3435534591194969
Max number of matches per record for 1990-2015 137
Records with exactly 1 match for 1990-2015 1155
Records with more than 1 match for 1990-2015 117


In [86]:
# Looking at the one with too many matches
matched_title[matched_title['Record ID']==7568]

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,RetractionYear
10645,7568,total quality management,,2623010252,total quality management,2016.0,2009.0
10646,7568,total quality management,,3151953841,total quality management,2007.0,2009.0
10647,7568,total quality management,,3148102789,total quality management,2004.0,2009.0
10648,7568,total quality management,,2622801231,total quality management,2017.0,2009.0
10649,7568,total quality management,,1979626407,total quality management,1990.0,2009.0
...,...,...,...,...,...,...,...
10777,7568,total quality management,,2929231812,total quality management,2018.0,2009.0
10778,7568,total quality management,,2418132854,total quality management,1992.0,2009.0
10779,7568,total quality management,,3144737536,total quality management,2013.0,2009.0
10780,7568,total quality management,,3148878130,total quality management,2006.0,2009.0


### Processing Fuzzy Matching

In [87]:
# Initializing the two lists for two ways we did fuzzy matching
dfs_exactyear = []
dfs_fuzzyyear = []


# going through the file list
for fname in flist:
    # Only reading if it is not exact match
    if fname != exact_match_fname:
        df = pd.read_csv(FUZZYMATCH_LOCAL_PATH+fname)
        # If it is exact year fuzzy matching
        if "exact_year" in fname:
            dfs_exactyear.append(df)
        # If it is fuzzy year fuzzy matching
        else:
            dfs_fuzzyyear.append(df)

In [113]:
# Processing exact year fuzzy matching
df_exactyear = pd.concat(dfs_exactyear)
df_exactyear = df_exactyear.merge(df_rw, on='Record ID')
df_exactyear.head(2)

Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,RetractionYear
0,vitamin d supplementation affects serum high s...,99.324324,4317227,2112332874,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0
1,vitamin d supplementation affects serum high s...,95.0,3035336,2188639929,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0


In [114]:
# summarizing exact year fuzzy matching
extract_record_summary(df_exactyear)

Number of unique records: 12429
Average number of matches per record 1.1482017861453053
Max number of matches per record 3
Records with exactly 1 match 10779
Records with more than 1 match 1650
###########
Number of unique records retracted between 1990-2015: 8090
Average number of matches per record for 1990-2015 1.1164400494437576
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 7263
Records with more than 1 match for 1990-2015 827


In [106]:
# processing fuzzy year fuzzy matching
df_fuzzyyear = pd.concat(dfs_fuzzyyear)
df_fuzzyyear = df_fuzzyyear.merge(df_rw, on='Record ID')
# removing those records that were in exact match
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(df_exactyear['Record ID'])]
df_fuzzyyear.head(2)

Unnamed: 0,RWTitleNorm,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RetractionYear
0,evaluation of the predictive performance of th...,retracted evaluation of the predictive perform...,95.0,11141080,2806309149,retracted evaluation of the predictive perform...,2019.0,21458,2019.0
2,an in vitro model for quantifying chemical tra...,retracted article an in vitro model for quanti...,95.0,12932717,2806662748,retracted article an in vitro model for quanti...,2019.0,21491,2019.0


In [107]:
# summarizing fuzzy year fuzzy matching
extract_record_summary(df_fuzzyyear)

Number of unique records: 1414
Average number of matches per record 1.2065063649222065
Max number of matches per record 3
Records with exactly 1 match 1154
Records with more than 1 match 260
###########
Number of unique records retracted between 1990-2015: 600
Average number of matches per record for 1990-2015 1.2316666666666667
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 476
Records with more than 1 match for 1990-2015 124


In [108]:
# Now let us read retraction notices and remove them from matches
df_retraction_notices = pd.read_csv(RETRACTION_NOTICES_LOCAL_PATH)
df_retraction_notices.head()

Unnamed: 0,PID,PaperTitle,OriginalTitle,PubYear,DocSubTypes
0,1591961241,notice of retraction a method of multi dimensi...,Notice of Retraction A method of multi-dimensi...,2010,Retraction Notice
1,1868656167,notice of retraction effect of asparagus polys...,Notice of Retraction Effect of Asparagus polys...,2010,Retraction Notice
2,2042025123,retraction note to therapeutic effects of metf...,Retraction Note to: Therapeutic effects of met...,2016,Retraction Notice
3,2050825760,retraction note to aging decreases rate of doc...,Retraction Note to: Aging decreases rate of do...,2013,Retraction Notice
4,2770316826,notice of retraction a study of the eye catchi...,Notice of Retraction A study of the eye-catchi...,2017,Retraction Notice


In [109]:
# Now let us remove these notices from both dataframes
df_exactyear_filtered = df_exactyear[df_exactyear['MAGPID'].isin(df_retraction_notices['PID'])]
extract_record_summary(df_exactyear_filtered)

Number of unique records: 5653
Average number of matches per record 1.0019458694498495
Max number of matches per record 2
Records with exactly 1 match 5642
Records with more than 1 match 11
###########
Number of unique records retracted between 1990-2015: 5032
Average number of matches per record for 1990-2015 1.0015898251192368
Max number of matches per record for 1990-2015 2
Records with exactly 1 match for 1990-2015 5024
Records with more than 1 match for 1990-2015 8


In [110]:
df_fuzzyyear_filtered = df_fuzzyyear[df_fuzzyyear['MAGPID'].isin(df_retraction_notices['PID'])]
extract_record_summary(df_fuzzyyear_filtered)

Number of unique records: 185
Average number of matches per record 1.0054054054054054
Max number of matches per record 2
Records with exactly 1 match 184
Records with more than 1 match 1
###########
Number of unique records retracted between 1990-2015: 80
Average number of matches per record for 1990-2015 1.0125
Max number of matches per record for 1990-2015 2
Records with exactly 1 match for 1990-2015 79
Records with more than 1 match for 1990-2015 1


In [62]:
df_fuzzyyear.groupby('Record ID')['MAGPID'].nunique().describe()

count    1414.000000
mean        1.206506
std         0.457462
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
Name: MAGPID, dtype: float64