## Processing Paper Matches

In this notebook, we shall process the RW-MAG paper matches, and consolidate the matches based on different criteria into a single csv file. We shall also remove the retraction notices from the MAG matches.

## The process of filtering records, and matches


0. **Remove retraction notices in RW**
1. **Remove all the bulk retractions**
2. **Remove all records with duplicate titles in RW**
3. **Remove all records beyond 1990-2015**
4. Paper Matching
       a. **Exact DOI**
       b. **Exact Title**
       c. Exact year fuzzy title
           c1. Keep all papers with the same DOI
           c2. Matches to hard code: MAGPID: 2418262483 for 4465 and 3011105395 for 24881
           c3. Remove all the matches that have "Retraction Note" or "Retraction Notice"
           c4. We will keep all the records with the same retraction year as publication year.
       d. Fuzzy year fuzzy matching
           d1. Keep all papers with the same DOI
           d3. Remove all the matches that have "Retraction Note" or "Retraction Notice"
           d4. We will keep all the records with the same retraction year as publication year.

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [2]:
# Reading paths
paths = read_config()
FUZZYMATCH_LOCAL_PATH = paths['FUZZYMATCH_LOCAL_PATH']
RW_ORIGINAL_W_YEAR_LOCAL_PATH = paths['RW_ORIGINAL_W_YEAR_LOCAL_PATH']
PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH = paths['PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH']
RETRACTION_NOTICES_LOCAL_PATH = paths['RETRACTION_NOTICES_LOCAL_PATH']

In [3]:
# Reading list of all files in fuzzy match directory
flist = os.listdir(FUZZYMATCH_LOCAL_PATH)

In [4]:
# Reading retraction watch dataset
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID', 'Title', 'RetractionDate' , 
                                                            'RetractionYear',
                                                           'RetractionDOI','OriginalPaperDOI',
                                                            'RetractionPubMedID', 'OriginalPaperPubMedID',
                                                           'Journal', 'ArticleType', 'Reason'])
df_rw.head(1)

Unnamed: 0,Record ID,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear
0,28599,TWEAK-Fn14 Influences Neurogenesis Status via ...,Molecular Neurobiology,Research Article;,2021-05-15,10.1007/s12035-021-02424-8,33991321,10.1007/s12035-016-0248-x,27822714,+Duplication of Image;+Unreliable Data;,2021.0


# 0. Identifying Retraction Notices RW

In [5]:
# We are extracting all articletypes that contain "Retraction notice" as a keyword (very few records)
records_removed_notices_from_RW = df_rw[~df_rw['ArticleType'].isna() & df_rw['ArticleType']\
                                        .str.contains('Retraction Notice')]['Record ID'].unique()

print(f"Number of records identified as retraction notices in RW: {len(records_removed_notices_from_RW)}")

Number of records identified as retraction notices in RW: 6


# 1. Identifying bulk retractions

In [6]:
# Identify bulk retreactions
df_rw_bulkCounts = df_rw.groupby(['Journal','RetractionDate'])['Record ID'].nunique().reset_index()\
                            .rename(columns={'Record ID':'bulkCounts'})

# merging with actual RW 
df_rw_temp = df_rw.merge(df_rw_bulkCounts, on=['Journal','RetractionDate'], how='left')

# Records to be removed due to bulk retractions
records_removed_bulk_retractions = df_rw_temp[df_rw_temp['bulkCounts'].ge(5)]['Record ID'].unique()

print(f"Number of records identified in bulk retractions: {len(records_removed_bulk_retractions)}")

Number of records identified in bulk retractions: 11357


In [7]:
# printing the difference

len(set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

11356

# 2. Identifying duplicate title records

In [8]:
# Extracting records with duplicate titles in RW
records_removed_duplicate_titles = df_rw[df_rw['Title'].duplicated(keep=False)]\
                                        .sort_values(by=['Title'])['Record ID'].unique()

print(f"Number of records identified in duplicate titles: {len(records_removed_duplicate_titles)}")

Number of records identified in duplicate titles: 123


In [9]:
# printing the difference

len(set(records_removed_duplicate_titles)-set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

108

# 3. Identifying records beyond 1990-2015

In [10]:
# extracting records that are not in our window
records_removed_1990_2015 = df_rw[df_rw['RetractionYear'].lt(1990) | 
                                 df_rw['RetractionYear'].gt(2015)]\
                                        ['Record ID'].unique()

print(f"Number of records identified beyond 1990-2015: {len(records_removed_1990_2015)}")

Number of records identified beyond 1990-2015: 10321


In [11]:
# looking at the difference in the number of records due to this specific filter
len(set(records_removed_1990_2015)-set(records_removed_duplicate_titles)\
        -set(records_removed_bulk_retractions)-set(records_removed_notices_from_RW))

8330

# * Creating big list of records removed so far

In [12]:
# Finally creating a big list of records that we just have to remove due to filtering process
records_filtered = set(list(records_removed_1990_2015)+list(records_removed_duplicate_titles)\
        +list(records_removed_bulk_retractions)+list(records_removed_notices_from_RW))

print(f"Number of records removed due to the above filters {len(records_filtered)}")

Number of records removed due to the above filters 19800


# 4. Paper Matching

In [248]:
def extract_record_summary(dfi):
    """
    This function shall be used to 
    give a summary in terms of the 
    (a) total records in the dataframe (dfi)
    as well as the 
    (b) total records between 1990-2015, as well as 
    (c) the average number of matches per record for 
    both (a) and (b) and also the 
    (d) the max number of matches for (a) and (b), and also
    (e) the number of records with more than 1 fuzzy match
    """
    
    print("Number of unique records:", dfi['Record ID'].nunique())
    print("Average number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record", 
          dfi.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfj = dfi.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match", 
          dfj[dfj['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match",
         dfj[dfj['NumMatches'].gt(1)]['Record ID'].nunique())
    
    
    print("###########")
    records_1990_2015 = dfi[dfi['RetractionYear'].ge(1990) & dfi['RetractionYear'].le(2015)]
    print("Number of unique records retracted between 1990-2015:", 
          records_1990_2015['Record ID'].nunique())
    print("Average number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().mean())
    print("Max number of matches per record for 1990-2015", 
          records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().max())
    dfk = records_1990_2015.groupby(['Record ID'])['MAGPID'].nunique().reset_index()\
                .rename(columns={'MAGPID':'NumMatches'})
    print("Records with exactly 1 match for 1990-2015", 
          dfk[dfk['NumMatches'].eq(1)]['Record ID'].nunique())
    print("Records with more than 1 match for 1990-2015",
         dfk[dfk['NumMatches'].gt(1)]['Record ID'].nunique())
    

## a. and b. Processing Exact Matching (based on DOI or Title)

In [257]:
# Let us explore exact matching first

# Reading papers that were matched by exact matching
df = pd.read_csv(PROCESSED_EXACT_PAPER_MATCH_LOCAL_PATH)
df = df.merge(df_rw.drop(columns=['OriginalPaperDOI']), on='Record ID')
df.head(1)

Unnamed: 0,Record ID,RWTitleNorm,OriginalPaperDOI,MAGPID,MAGTitle,MAGPubYear,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperPubMedID,Reason,RetractionYear
0,28505,tet1 exerts its tumor suppressor function by regulating autophagy in glioma cells,10.1042/BSR20160523,2597493214,tet1 exerts its tumour suppressor function by regulating autophagy in glioma cells,2017.0,TET1 exerts its tumor suppressor function by regulating autophagy in glioma cells,Bioscience Reports,Research Article;,2021-05-14,10.1042/BSR-20160523_RET,33988682,28341638,+Concerns/Issues About Data;+Duplication of Image;+Upgrade/Update of Prior Notice;,2021.0


In [258]:
# Extracting those that were matched only on DOI i.e. their DOI != NaN
matched_doi = df[~df['OriginalPaperDOI'].isna()]

# Let us remove all the records that were filtered
matched_doi = matched_doi[~matched_doi['Record ID'].isin(records_filtered)]

extract_record_summary(matched_doi)

Number of unique records: 1625
Average number of matches per record 1.0104615384615385
Max number of matches per record 4
Records with exactly 1 match 1613
Records with more than 1 match 12
###########
Number of unique records retracted between 1990-2015: 1625
Average number of matches per record for 1990-2015 1.0104615384615385
Max number of matches per record for 1990-2015 4
Records with exactly 1 match for 1990-2015 1613
Records with more than 1 match for 1990-2015 12


In [261]:
# Extracting records that were matched on title (coz no DOI)
matched_title = df[df.OriginalPaperDOI.isna()]

# Let us remove all the records that were filtered
matched_title = matched_title[~matched_title['Record ID'].isin(records_filtered)]

#matched_title = matched_title[matched_title['Record ID'].isin(matched_doi['Record ID'].unique())]

extract_record_summary(matched_title)

Number of unique records: 1021
Average number of matches per record 1.1547502448579823
Max number of matches per record 20
Records with exactly 1 match 926
Records with more than 1 match 95
###########
Number of unique records retracted between 1990-2015: 1021
Average number of matches per record for 1990-2015 1.1547502448579823
Max number of matches per record for 1990-2015 20
Records with exactly 1 match for 1990-2015 926
Records with more than 1 match for 1990-2015 95


# c. and d. Processing Fuzzy Matching

In [263]:
# Initializing the two lists for two ways we did fuzzy matching
dfs_exactyear = []
dfs_fuzzyyear = []


# going through the file list
for fname in flist:
    # Only reading if it is not exact match
    if fname != exact_match_fname:
        df = pd.read_csv(FUZZYMATCH_LOCAL_PATH+fname)
        # If it is exact year fuzzy matching
        if "exact_year" in fname:
            dfs_exactyear.append(df)
        # If it is fuzzy year fuzzy matching
        else:
            dfs_fuzzyyear.append(df)

## c. Processing exact year fuzzy matching

In [270]:
# Processing exact year fuzzy matching
df_exactyear = pd.concat(dfs_exactyear)
df_exactyear = df_exactyear.merge(df_rw, on='Record ID')

# Removing records that were filtered
df_exactyear = df_exactyear[~df_exactyear['Record ID'].isin(records_filtered)]

df_exactyear.head(2)

Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear
76,retracted use of upper triangular matrix tracking for complexity reduction in a linear zf mimo system,95.0,7148009,2028375193,retracted use of upper triangular matrix tracking for complexity reduction in a linear zf mimo system,2013.0,18798,use of upper triangular matrix tracking for complexity reduction in a linear zf mimo system,Use of upper triangular matrix tracking for complexity reduction in a linear ZF MIMO system,Signal Processing,Research Article;,2013-09-01,10.1016/j.sigpro.2013.01.002,0,10.1016/j.sigpro.2013.01.002,0,+Date of Retraction/Other Unknown;+Plagiarism of Article;,2013.0
198,retracted gamma glutamyl transferase activity in kids born from goats fed genetically modified soybean,95.0,7380434,2003800286,retracted gamma glutamyl transferase activity in kids born from goats fed genetically modified soybean,2013.0,9165,gamma-glutamyl transferase activity in kids born from goats fed genetically modified soybean,Gamma-Glutamyl Transferase Activity in Kids Born from Goats Fed Genetically Modified Soybean,Food and Nutrition Sciences,Research Article;,2015-12-15,10.4236/fns.2013.46A006,0,10.4236/fns.2013.46A006,0,+Falsification/Fabrication of Data;+Falsification/Fabrication of Results;+Investigation by Company/Institution;+Investigation by Journal/Publisher;+Misconduct by Author;+Objections by Third Party;,2015.0


In [271]:
# summarizing exact year fuzzy matching
extract_record_summary(df_exactyear)

Number of unique records: 3137
Average number of matches per record 1.1976410583359898
Max number of matches per record 3
Records with exactly 1 match 2596
Records with more than 1 match 541
###########
Number of unique records retracted between 1990-2015: 3137
Average number of matches per record for 1990-2015 1.1976410583359898
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2596
Records with more than 1 match for 1990-2015 541


In [281]:
# Let us now define a list of papers to keep
# These will be those that are 
# a) in df_exactyear and have retraction doi same as original paper doi
# b) in df_exactyear and do not have "Retraction Note" or "Retraction Notice" in their title
# c) in df_exactyear and have publication year same as retraction year

records_same_doi = df_rw[df_rw['OriginalPaperDOI'].eq(df_rw['RetractionDOI']) & 
                         ~df_rw['RetractionDOI'].isin(['unavailable','Unavailable'])]['Record ID'].unique()

df_exactyear2 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi)]

extract_record_summary(df_exactyear2)


Number of unique records: 472
Average number of matches per record 1.1398305084745763
Max number of matches per record 3
Records with exactly 1 match 412
Records with more than 1 match 60
###########
Number of unique records retracted between 1990-2015: 472
Average number of matches per record for 1990-2015 1.1398305084745763
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 412
Records with more than 1 match for 1990-2015 60


In [294]:
# Adding records that were not in retraction notices


# Now let us read retraction notices and remove them from matches
df_retraction_notices = pd.read_csv(RETRACTION_NOTICES_LOCAL_PATH)

records_not_in_notices = df_exactyear[~df_exactyear['MAGPID'].isin(df_retraction_notices['PID'])]['Record ID'].unique()

df_exactyear22 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi) | 
                             df_exactyear['Record ID'].isin(records_not_in_notices)]

extract_record_summary(df_exactyear22)

Number of unique records: 3029
Average number of matches per record 1.2040277319247277
Max number of matches per record 3
Records with exactly 1 match 2490
Records with more than 1 match 539
###########
Number of unique records retracted between 1990-2015: 3029
Average number of matches per record for 1990-2015 1.2040277319247277
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2490
Records with more than 1 match for 1990-2015 539


In [295]:
# Adding records with same pubyear  as retraction year
records_same_pubyear_ryear = df_exactyear[df_exactyear['MAGPubYear'].eq(df_exactyear['RetractionYear'])]\
                                ['Record ID'].unique()

df_exactyear3 = df_exactyear[df_exactyear['Record ID'].isin(records_same_doi) | 
                            df_exactyear['Record ID'].isin(records_same_pubyear_ryear) | 
                            df_exactyear['Record ID'].isin(records_not_in_notices)]

extract_record_summary(df_exactyear3)

Number of unique records: 3097
Average number of matches per record 1.2001937358734258
Max number of matches per record 3
Records with exactly 1 match 2556
Records with more than 1 match 541
###########
Number of unique records retracted between 1990-2015: 3097
Average number of matches per record for 1990-2015 1.2001937358734258
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 2556
Records with more than 1 match for 1990-2015 541


In [287]:
# removing those with retraction notice keywords

df_exactyear4 = df_exactyear3[~df_exactyear3['MAGTitle'].str.contains('Retraction Notice') & 
                             ~df_exactyear3['MAGTitle'].str.contains('Retraction Note')]

extract_record_summary(df_exactyear4)

Number of unique records: 1111
Average number of matches per record 1.2736273627362735
Max number of matches per record 3
Records with exactly 1 match 844
Records with more than 1 match 267
###########
Number of unique records retracted between 1990-2015: 1111
Average number of matches per record for 1990-2015 1.2736273627362735
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 844
Records with more than 1 match for 1990-2015 267


In [297]:
# Checking those that were removed

df_exactyear_remaining = df_exactyear[~df_exactyear['Record ID'].isin(df_exactyear3['Record ID'])]

# Checking how many were in retraction notices


df_exactyear3[df_exactyear3['Record ID'].isin(matched_doi['Record ID'])]

Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear


## d. Processing fuzzy year fuzzy matching

In [272]:
# processing fuzzy year fuzzy matching
df_fuzzyyear = pd.concat(dfs_fuzzyyear)
df_fuzzyyear = df_fuzzyyear.merge(df_rw, on='Record ID')

# removing those records that were in exact match
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(df_exactyear['Record ID'])]

# removing those records that were filtered
df_fuzzyyear = df_fuzzyyear[~df_fuzzyyear['Record ID'].isin(records_filtered)]

df_fuzzyyear.head(2)

Unnamed: 0,RWTitleNorm,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,Title,Journal,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDOI,OriginalPaperPubMedID,Reason,RetractionYear
198,protective effects of bazedoxifene paired with conjugated estrogens on pancreatic beta-cell dysfunction,retraction protective effects of bazedoxifene paired with conjugated estrogens on pancreatic β cell dysfunction,92.636816,7060760,2179077735,retraction protective effects of bazedoxifene paired with conjugated estrogens on pancreatic β cell dysfunction,2016.0,17304,Protective effects of bazedoxifene paired with conjugated estrogens on pancreatic beta-cell dysfunction,Biological & Pharmaceutical Bulletin,Research Article;,2015-11-06,10.1248/bpb.b15-00585,26548420,10.1248/bpb.b15-00585,26548420,+Concerns/Issues About Authorship;+Conflict of Interest;,2015.0
208,overexpression of thaumatin gene confers enhanced resistance to alternariabrassicae and tolerance to salinity and drought in transgenic brassicajuncea (l.) czern,retracted article overexpression of thaumatin gene confers enhanced resistance to alternaria brassicae and tolerance to salinity and drought in transgenic brassica juncea l czern,93.215339,4739624,1925258414,retracted article overexpression of thaumatin gene confers enhanced resistance to alternaria brassicae and tolerance to salinity and drought in transgenic brassica juncea l czern,2016.0,8766,Overexpression of thaumatin gene confers enhanced resistance to Alternariabrassicae and tolerance to salinity and drought in transgenic Brassicajuncea (L.) Czern,"Plant Cell, Tissue and Organ Culture (PCTOC)",Research Article;,2015-08-20,10.1007/s11240-015-0846-8,0,10.1007/s11240-015-0846-8,0,+Lack of Approval from Author;,2015.0


In [273]:
# summarizing fuzzy year fuzzy matching
extract_record_summary(df_fuzzyyear)

Number of unique records: 542
Average number of matches per record 1.2250922509225093
Max number of matches per record 3
Records with exactly 1 match 433
Records with more than 1 match 109
###########
Number of unique records retracted between 1990-2015: 542
Average number of matches per record for 1990-2015 1.2250922509225093
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 433
Records with more than 1 match for 1990-2015 109


In [108]:
# Now let us read retraction notices and remove them from matches
df_retraction_notices = pd.read_csv(RETRACTION_NOTICES_LOCAL_PATH)
df_retraction_notices.head()

Unnamed: 0,PID,PaperTitle,OriginalTitle,PubYear,DocSubTypes
0,1591961241,notice of retraction a method of multi dimensi...,Notice of Retraction A method of multi-dimensi...,2010,Retraction Notice
1,1868656167,notice of retraction effect of asparagus polys...,Notice of Retraction Effect of Asparagus polys...,2010,Retraction Notice
2,2042025123,retraction note to therapeutic effects of metf...,Retraction Note to: Therapeutic effects of met...,2016,Retraction Notice
3,2050825760,retraction note to aging decreases rate of doc...,Retraction Note to: Aging decreases rate of do...,2013,Retraction Notice
4,2770316826,notice of retraction a study of the eye catchi...,Notice of Retraction A study of the eye-catchi...,2017,Retraction Notice


In [125]:
df_exactyear

Unnamed: 0,MAGTitle,score,index,MAGPID,MAGTitle.1,MAGPubYear,Record ID,RWTitleNorm,RetractionYear
0,vitamin d supplementation affects serum high s...,99.324324,4317227,2112332874,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0
1,vitamin d supplementation affects serum high s...,95.000000,3035336,2188639929,vitamin d supplementation affects serum high s...,2013.0,28372,vitamin d supplementation affects serum high-s...,2021.0
2,effects of vitamin d supplementation on glucos...,98.952880,4626370,2159206233,effects of vitamin d supplementation on glucos...,2013.0,28312,effects of vitamin d supplementation on glucos...,2021.0
3,microrna 222 promotes tumorigenesis via target...,96.744186,8236619,2014433516,microrna 222 promotes tumorigenesis via target...,2013.0,27886,micrornaâ€222 promotes tumorigenesis via targ...,2021.0
4,mir 26a and its target cks2 modulate cell grow...,100.000000,10356493,2013406919,mir 26a and its target cks2 modulate cell grow...,2013.0,27451,mir-26a and its target cks2 modulate cell grow...,2021.0
...,...,...,...,...,...,...,...,...,...
14266,notice of retraction extraction of the polysac...,95.000000,2436854,1570021854,notice of retraction extraction of the polysac...,2011.0,27479,extraction of the polysaccharides from dunalie...,2011.0
14267,notice of retraction fabrication of injectable...,95.000000,1093255,1547960423,notice of retraction fabrication of injectable...,2011.0,27480,fabrication of injectable plla/alginate hydrog...,2011.0
14268,notice of retraction factor analysis of nitrat...,95.000000,6423322,2275721726,notice of retraction factor analysis of nitrat...,2011.0,27481,factor analysis of nitrate contamination on gr...,2011.0
14269,notice of retraction factor analysis on enviro...,95.000000,7351381,1938967737,notice of retraction factor analysis on enviro...,2011.0,27482,factor analysis on environmental parameters in...,2011.0


In [127]:
# Now let us remove these notices from both dataframes
df_exactyear_filtered = df_exactyear[~df_exactyear['MAGPID'].isin(df_retraction_notices['PID'])]
extract_record_summary(df_exactyear_filtered)

Number of unique records: 7586
Average number of matches per record 1.1345900342736621
Max number of matches per record 3
Records with exactly 1 match 6673
Records with more than 1 match 913
###########
Number of unique records retracted between 1990-2015: 3486
Average number of matches per record for 1990-2015 1.1451520367183017
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 3044
Records with more than 1 match for 1990-2015 442


In [142]:
# Roughwork

df_retraction_notices[df_retraction_notices['PID']==1511764858]

Unnamed: 0,PID,PaperTitle,OriginalTitle,PubYear,DocSubTypes
3918,1511764858,notice of retraction pump shaft strength calcu...,Notice of Retraction Pump shaft strength calcu...,2010,Retraction Notice


In [143]:
df_retraction_notices[df_retraction_notices['PID']==1492572588]

Unnamed: 0,PID,PaperTitle,OriginalTitle,PubYear,DocSubTypes
2681,1492572588,notice of retraction pump body strength numeri...,Notice of Retraction Pump body strength numeri...,2010,Retraction Notice


In [183]:

# Rough work

# Printing the records with matches that are retraction notices
pd.set_option('display.max_colwidth', None)

# Extracting the number of matches for those records that were in retraction notices -- and hence removed
df_match_counts = df_exactyear[~df_exactyear['Record ID'].isin(df_exactyear_filtered['Record ID'])]\
                        .groupby('Record ID')['MAGPID'].nunique().reset_index()

# Extracting those records that were in retraction notices -- and hence removed from filtered
df_exactyear_in_rn = df_exactyear[~df_exactyear['Record ID'].isin(df_exactyear_filtered['Record ID'])]

# Checking the ones that have only one match
# df_exactyear_in_rn[df_exactyear_in_rn['Record ID'].isin(df_match_counts[df_match_counts['MAGPID'].eq(1)]\
#                                                             ['Record ID'])]

# Loading retraction watch to check how many records have retraction DOI and original Paper DOI as the same
df_rw = pd.read_csv(RW_ORIGINAL_W_YEAR_LOCAL_PATH, usecols=['Record ID','RetractionDOI','OriginalPaperDOI',
                                                            'RetractionPubMedID', 'OriginalPaperPubMedID',
                                                           'Journal'])\
                        .drop_duplicates()

df_rw_sameDOI = df_rw[df_rw['OriginalPaperDOI'].eq(df_rw['RetractionDOI']) | 
                     df_rw['RetractionPubMedID'].eq(df_rw['OriginalPaperPubMedID'])]


df_exactyear_sameDOI = df_exactyear_in_rn[df_exactyear_in_rn['Record ID'].isin(df_rw_sameDOI['Record ID'])]


df_exactyear_stillNotMatched = df_exactyear_in_rn[~df_exactyear_in_rn['Record ID'].\
                                                      isin(df_exactyear_sameDOI['Record ID'])]



df_exactyear_stillNotMatched[df_exactyear_stillNotMatched['RetractionYear'] == \
                                 df_exactyear_stillNotMatched['MAGPubYear']]['Record ID'].nunique()


df_exactyear_stillNotMatched[df_exactyear_stillNotMatched['RetractionYear'].isin(range(1990,2016))]

df_exactyear_stillNotMatched.merge(df_rw, on='Record ID').groupby('Journal')['Record ID'].nunique()\
    .reset_index().sort_values(by='Record ID').tail(50)

Unnamed: 0,Journal,Record ID
65,2010 International Conference on Advances in Energy Engineering,9
14,2009 IEEE International Conference on Network Infrastructure and Digital Content,9
11,2009 Asia-Pacific Power and Energy Engineering Conference,9
71,"2010 International Conference on Computer, Mechatronics, Control and Electronic Engineering",10
31,2009 Pacific-Asia Conference on Knowledge Engineering and Software Engineering,10
77,2010 International Conference on Financial Theory and Engineering,11
6,2009 2nd International Conference on Future Information Technology and Management Engineering,11
74,2010 International Conference on Education and Management Technology,11
73,2010 International Conference on E-Health Networking Digital Ecosystems and Technologies (EDT),11
26,2009 International Conference on Machine Learning and Cybernetics,12


In [128]:
df_fuzzyyear_filtered = df_fuzzyyear[~df_fuzzyyear['MAGPID'].isin(df_retraction_notices['PID'])]
extract_record_summary(df_fuzzyyear_filtered)

Number of unique records: 1349
Average number of matches per record 1.1267605633802817
Max number of matches per record 3
Records with exactly 1 match 1201
Records with more than 1 match 148
###########
Number of unique records retracted between 1990-2015: 571
Average number of matches per record for 1990-2015 1.1523642732049038
Max number of matches per record for 1990-2015 3
Records with exactly 1 match for 1990-2015 491
Records with more than 1 match for 1990-2015 80


In [62]:
df_fuzzyyear.groupby('Record ID')['MAGPID'].nunique().describe()

count    1414.000000
mean        1.206506
std         0.457462
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
Name: MAGPID, dtype: float64