## Extract final retraction notices

In this notebook, we shall extract retraction notices to be removed from all calculation. 

We will do so by:

1. Extracting retraction notices that were identified based on MAG
2. Removing from those that were matched to a record in RW (using our logic in 0e.process_paper_matching.ipynb)
3. We will save this difference so that we can:

    a. Remove these notices when identifying attrition year of authors
    
    b. Remove these notices when calculating confounders (as the year of retraction was included)

In [1]:
# Importing relevant packages

import pandas as pd
import os
from config_reader import read_config

In [2]:
# Reading paths
paths = read_config()
RETRACTION_NOTICES_LOCAL_PATH = paths['RETRACTION_NOTICES_LOCAL_PATH']
PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH = paths['PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH']
OUTDIR = paths['PROCESSED_FOLDER_LOCAL']

In [16]:
df_notices = pd.read_csv(RETRACTION_NOTICES_LOCAL_PATH)
df_notices.head()

Unnamed: 0,PID,PaperTitle,OriginalTitle,PubYear,DocSubTypes
0,1591961241,notice of retraction a method of multi dimensi...,Notice of Retraction A method of multi-dimensi...,2010,Retraction Notice
1,1868656167,notice of retraction effect of asparagus polys...,Notice of Retraction Effect of Asparagus polys...,2010,Retraction Notice
2,2042025123,retraction note to therapeutic effects of metf...,Retraction Note to: Therapeutic effects of met...,2016,Retraction Notice
3,2050825760,retraction note to aging decreases rate of doc...,Retraction Note to: Aging decreases rate of do...,2013,Retraction Notice
4,2770316826,notice of retraction a study of the eye catchi...,Notice of Retraction A study of the eye-catchi...,2017,Retraction Notice


In [17]:
df_notices['PID'].nunique()

16115

In [18]:
magpids_rw_mag_matches = pd.read_csv(PROCESSED_RW_MAG_FINAL_PAPER_MATCHES_LOCAL_PATH)['MAGPID'].unique()
len(magpids_rw_mag_matches)

6187

In [20]:
# Now let us remove them

df_notices_postfiltering = df_notices[~df_notices['PID'].isin(magpids_rw_mag_matches)]
df_notices_postfiltering['PID'].nunique()

15892

In [22]:
# Saving

FILENAME = "retraction_notices_postfiltering"

# Create a full file path with timestamp
file_path = os.path.join(OUTDIR, f"{FILENAME}.csv")

# Writing DataFrame to CSV with error handling
try:
    df_notices_postfiltering.to_csv(file_path, index=False)
    print(f"File saved successfully")
except Exception as e:
    print(f"Error saving file: {e}")

File saved successfully
