## Processing Paper Matches

In this notebook, we shall process the RW-MAG paper matches, and consolidate the matches based on different criteria into a single csv file. We shall also remove the retraction notices from the MAG matches.

In [71]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [72]:
# Reading paths
paths = read_config()
FUZZYMATCH_LOCAL_PATH = paths['FUZZYMATCH_LOCAL_PATH']
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH = paths['PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH']

In [75]:
# Reading list of all files in fuzzy match directory
flist = os.listdir(FUZZYMATCH_LOCAL_PATH)

In [76]:
# Reading retraction watch dataset
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID','RetractionYear'])
df_rw.head()

Unnamed: 0,Record ID,RetractionYear
0,28599,2021.0
1,28504,2021.0
2,28506,2021.0
3,28505,2021.0
4,28502,2021.0


### Processing Exact Matching (based on DOI or Title)

In [77]:
# Let us explore exact matching first

df = pd.read_csv(PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH)
df = df.merge(df_rw, on='Record ID')
df.head()

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,RetractionYear
0,28505,tet1 exerts its tumor suppressor function by r...,10.1042/BSR20160523,2597493214,tet1 exerts its tumour suppressor function by ...,2017.0,2021.0
1,28498,upregulation of oxidative stress-responsive 1(...,10.1080/21655979.2020.1814659,3081313700,upregulation of oxidative stress responsive 1 ...,2020.0,2021.0
2,28596,the facets of gender inequality and homicide: ...,10.1080/08974454.2019.1632773,2958272459,retracted article the facets of gender inequal...,2019.0,2021.0
3,28351,opioid-sparing effect of modified intercostal ...,10.1097/EJA.0000000000001394,3104158721,opioid sparing effect of modified intercostal ...,2020.0,2021.0
4,28302,beijing as the hub of crf07_bc transmission fr...,10.1089/AID.2020.0147,3112479964,beijing as the hub of crf07_bc transmission fr...,2020.0,2021.0


In [78]:
# Extracting those that were matched only on DOI

matched_doi = df[~df.OriginalPaperDOI.isna()]
matched_doi['Record ID'].nunique()

7906

In [80]:
# Extracting those matched on DOI but retracted between years 1990-2015 

matched_doi[matched_doi['RetractionYear'].le(2015) & matched_doi['RetractionYear'].ge(1990)]\
        ['Record ID'].nunique()

5593

In [81]:
# Checking the number of matches per record
matched_doi[['Record ID','MAGPID']].groupby('Record ID')['MAGPID'].nunique().describe()
# Most have 1 match, except a few

count    7906.000000
mean        1.010372
std         0.331204
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max        21.000000
Name: MAGPID, dtype: float64

In [82]:
# Extracting records that were matched on title (coz no DOI)
matched_title = df[df.OriginalPaperDOI.isna()]
matched_title['Record ID'].nunique()

2687

In [83]:
# Extracting only those records that were between 1990-2015
matched_title[matched_title['RetractionYear'].le(2015) & matched_title['RetractionYear'].ge(1990)]\
        ['Record ID'].nunique()

1272

In [85]:
# Also counting the number of matches per record.

matched_title[['Record ID','MAGPID']].groupby('Record ID')['MAGPID'].nunique().describe()
# most have 1, but 208 have more than 1 matches.

count    2687.000000
mean        1.222181
std         3.533815
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max       137.000000
Name: MAGPID, dtype: float64

In [86]:
# Looking at the one with too many matches
matched_title[matched_title['Record ID']==7568]

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,RetractionYear
10645,7568,total quality management,,2623010252,total quality management,2016.0,2009.0
10646,7568,total quality management,,3151953841,total quality management,2007.0,2009.0
10647,7568,total quality management,,3148102789,total quality management,2004.0,2009.0
10648,7568,total quality management,,2622801231,total quality management,2017.0,2009.0
10649,7568,total quality management,,1979626407,total quality management,1990.0,2009.0
...,...,...,...,...,...,...,...
10777,7568,total quality management,,2929231812,total quality management,2018.0,2009.0
10778,7568,total quality management,,2418132854,total quality management,1992.0,2009.0
10779,7568,total quality management,,3144737536,total quality management,2013.0,2009.0
10780,7568,total quality management,,3148878130,total quality management,2006.0,2009.0


### Processing Fuzzy Matching

In [87]:
# Initializing the two lists for two ways we did fuzzy matching
dfs_exactyear = []
dfs_fuzzyyear = []


# going through the file list
for fname in flist:
    # Only reading if it is not exact match
    if fname != exact_match_fname:
        df = pd.read_csv(FUZZYMATCH_LOCAL_PATH+fname)
        # If it is exact year fuzzy matching
        if "exact_year" in fname:
            dfs_exactyear.append(df)
        # If it is fuzzy year fuzzy matching
        else:
            dfs_fuzzyyear.append(df)

In [None]:
def extract_record_summary(dfi):
    """
    This function shall be used to 
    give a summary in terms of the 
    (a) total records in the dataframe (dfi)
    as well as the 
    (b) total records between 1990-2015, as well as 
    (c) the average number of matches per record for 
    both (a) and (b) and also the 
    (d) the max number of matches for (a) and (b), and also
    (e) the number of records with more than 1 fuzzy match
    """
    
    print("Number of unique records:", dfi['Record ID'].nunique())
    print("Average number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().max())
    
    
    
    print("###########")
    records_1990_2015 = dfi[dfi['RetractionYear'].ge(1990) & dfi['RetractionYear'].le(2015)]
    print("Number of unique records retracted between 1990-2015:", 
          records_1990_2015['Record ID'].nunique())
    print("Average number of matches per record for 1990-2015", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record for 1990-2015", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().max())
    
    
    
    

In [88]:
# Processing exact year fuzzy matching
df_exactyear = pd.concat(dfs_exactyear)
df_exactyear = df_exactyear.merge(df_rw, on='Record ID')
df_exactyear.head(2)


Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,RetractionYear
0,vitamin d supplementation affects serum high s...,99.324324,4317227,2112332874,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0
1,vitamin d supplementation affects serum high s...,95.0,3035336,2188639929,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0


In [53]:

df_exactyear['Record ID'].nunique()

12429

In [54]:
df_exactyear[df_exactyear['RetractionYear'].ge(1990) & df_exactyear['RetractionYear'].le(2015)]['Record ID'].nunique()

8090

In [55]:
df_exactyear.groupby(['Record ID'])['MAGPID'].nunique().describe()

count    12429.000000
mean         1.148202
std          0.396417
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
Name: MAGPID, dtype: float64

In [58]:
df_fuzzyyear = pd.concat(dfs_fuzzyyear)
df_fuzzyyear = df_fuzzyyear.merge(df_rw, on='Record ID')
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(df_exactyear['Record ID'])]
df_fuzzyyear

Unnamed: 0,RWTitleNorm,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RetractionYear
0,evaluation of the predictive performance of th...,retracted evaluation of the predictive perform...,95.000000,11141080,2806309149,retracted evaluation of the predictive perform...,2019.0,21458,2019.0
2,an in vitro model for quantifying chemical tra...,retracted article an in vitro model for quanti...,95.000000,12932717,2806662748,retracted article an in vitro model for quanti...,2019.0,21491,2019.0
3,advances in diagnosis of endometrial hyperplasia,retracted article advances in diagnosis of end...,95.000000,4419696,2793142489,retracted article advances in diagnosis of end...,2019.0,21563,2019.0
4,vitamin d supplementation affects the beck dep...,vitamin d supplementation affects the beck dep...,99.004975,16028362,2177838021,vitamin d supplementation affects the beck dep...,2016.0,28440,2021.0
5,the highly efficient adsorption of pb(ii) on g...,the highly efficient adsorption of pb ii on gr...,99.193548,1573833,2218482279,the highly efficient adsorption of pb ii on gr...,2016.0,27717,2021.0
...,...,...,...,...,...,...,...,...,...
3135,acclimation of 2-chlorophenol-biodegrading act...,acclimation of 2 chlorophenol biodegrading act...,100.000000,14532241,2903597653,acclimation of 2 chlorophenol biodegrading act...,2018.0,20722,2019.0
3136,optimization of planning and design of urban s...,retracted optimization of planning and design ...,95.000000,8610842,2802356270,retracted optimization of planning and design ...,2018.0,20410,2019.0
3140,can baseline endocrinological examination and ...,retracted article can baseline endocrinologica...,95.000000,14919396,2921607107,retracted article can baseline endocrinologica...,2020.0,21361,2019.0
3141,iot based prudent automatic falls detection fo...,retracted article iot based prudent automatic ...,95.000000,7339830,2922321837,retracted article iot based prudent automatic ...,2020.0,28354,2019.0


In [59]:
df_fuzzyyear['Record ID'].nunique()

1414

In [61]:
df_fuzzyyear[df_fuzzyyear['RetractionYear'].ge(1990) & df_fuzzyyear['RetractionYear'].le(2015)]['Record ID'].nunique()

600

In [62]:
df_fuzzyyear.groupby('Record ID')['MAGPID'].nunique().describe()

count    1414.000000
mean        1.206506
std         0.457462
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
Name: MAGPID, dtype: float64