# Data Dropping & Augmentation: Creating Matching Analytical Samples

In this notebook, we shall augment some necessary columns within our dataframe before we conduct the matching analysis.

**First we shall drop the unnecessary columns. This is being done after Bedoor and Kinga asked me to do closest matching with distance measure (Almost right before submission to PNAS).**


Here are the steps we will take:

We shall create the matching analytical sample for paper, citation, collaborators distance of (i) 10%, (ii) 20%, and (iii) 30%. To do that the following steps will be taken:

1. First, we shall load the three matching files for 10%, 20% and 30%. 
2. Then, we shall load the confounders that were matched for treatment and control using the file: **RWMatched_intersection_wPapersCitationsCollaborators_wCollabYear_closestMatch30.csv**
3. Then we shall load the stratification variables: reason, time of retraction, order of author in the retracted paper, type of retraction, author academic age, map author affiliation rank, impute retractor majority to avoid NaNs using the file: **filtered_sample.csv**
4. Finally, we shall compute the outcome variables using the files: **RW_MAGcollaborators_1stDegree_rematching_woPapersCitationsCollaborators_wCollabYear_le2020_closestMatch30.csv** and **To be filled**. The outcome variables are
    1. Number of collaborators retained by authors and their matches
    2. Number of collaborators gained by authors and their matches
    3. Number of triads closed by authors and their matches
    4. Proportion of triads closed by authors and their matches (Newman's Coefficient or NC)


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Step 1:  load the three matching files for 10%, 20% and 30%.

relevant_cols = ['MAGAID', 'MatchMAGAID', 'Record ID',]

indir = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/author_matching/"

dfmatched_30perc = pd.read_csv(indir+"/closestAverageMatch_tolerance_0.3_w_0.8.csv",
                              usecols=relevant_cols).drop_duplicates()

dfmatched_20perc = pd.read_csv(indir+"/closestAverageMatch_tolerance_0.2_w_0.8.csv",
                              usecols=relevant_cols).drop_duplicates()

dfmatched_10perc = pd.read_csv(indir+"/closestAverageMatch_tolerance_0.1_w_0.8.csv",
                              usecols=relevant_cols).drop_duplicates()


In [3]:
dfmatched_30perc[dfmatched_30perc.MatchMAGAID.duplicated()]

Unnamed: 0,MAGAID,MatchMAGAID,Record ID
1323,2115123316,2.144725e+09,4442.0
1415,2117168174,2.101954e+09,8450.0
1416,2117168174,2.304848e+09,8450.0
1745,2136126495,2.048001e+09,2132.0
1811,2142423705,2.041562e+09,4911.0
...,...,...,...
3727,2765273410,2.141240e+09,16929.0
3728,2765273410,2.146519e+09,16929.0
3729,2765273410,2.557500e+09,16929.0
3756,2791620886,2.715645e+09,5429.0


In [4]:
dfmatched_30perc['MAGAID'].nunique()

2348

In [5]:
dfmatched_10perc.columns

Index(['MAGAID', 'MatchMAGAID', 'Record ID'], dtype='object')

In [6]:
dfmatched_10perc.shape, dfmatched_20perc.shape, dfmatched_30perc.shape

((2416, 3), (3397, 3), (4054, 3))

In [7]:
dfmatched_10perc.MAGAID.nunique(), dfmatched_20perc.MAGAID.nunique(), dfmatched_30perc.MAGAID.nunique()

(751, 1700, 2348)

In [8]:
# Step2: load the confounders that were matched for treatment and control

df_confounders = pd.read_csv(indir+"RWMatched_intersection_wPapersCitationsCollaboratorsAtRetraction_wCollabYear_wActivityPostRetraction.csv")

df_confounders.columns

  df_confounders = pd.read_csv(indir+"RWMatched_intersection_wPapersCitationsCollaboratorsAtRetraction_wCollabYear_wActivityPostRetraction.csv")


Index(['MAGAID', 'MatchMAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffRank',
       'MAGRetractionYearAffYear', 'MatchMAGRetractionYearAffID',
       'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
       'MatchMAGMaxRetractionYear', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent', 'GenderizeGender',
       'MAGFirstPubYear', 'MAGFirstAffID', 'MAGFirstAffiliationRank',
       'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
       'MatchMAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MatchMAGCumPapersYearAtRetraction',
       'MatchMAGCumPapersAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction',
       'MatchMAGCumCollaborat

In [9]:
def merge_confounders(dfi):
    dfi_w_confounders = dfi.merge(df_confounders,on=['MAGAID','MatchMAGAID','Record ID'])
    assert(dfi_w_confounders.MAGAID.nunique() == dfi.MAGAID.nunique())
    assert(dfi_w_confounders.MatchMAGAID.nunique() == dfi.MatchMAGAID.nunique())
    return dfi_w_confounders.drop_duplicates()


dfmatched_10_perc_wConfounders = merge_confounders(dfmatched_10perc)
dfmatched_20_perc_wConfounders = merge_confounders(dfmatched_20perc)
dfmatched_30_perc_wConfounders = merge_confounders(dfmatched_30perc)



In [10]:
dfmatched_30_perc_wConfounders

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumPapersAtRetraction,MAGCumCitationsAtRetraction,MAGCumCitationsYearAtRetraction,MatchMAGCumCitationsYearAtRetraction,MatchMAGCumCitationsAtRetraction,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGRetractionYearAffRankOrdinal
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,7.0,6.0,2008.0,2008.0,6.0,22.0,2008.0,2008.0,23.0,175.0
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,7.0,6.0,2008.0,2008.0,6.0,22.0,2008.0,2008.0,21.0,175.0
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,3.0,18.0,2012.0,2012.0,18.0,16.0,2012.0,2011.0,15.0,450.0
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,3.0,18.0,2012.0,2012.0,18.0,16.0,2012.0,2011.0,15.0,450.0
4,9474215,2.169122e+09,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,79576946.0,1999.0,...,60.0,6233.0,2015.0,2015.0,5692.0,284.0,2015.0,2015.0,207.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5413,3174447547,1.897294e+09,23452.0,2.166758e+09,2008.0,99065089.0,43,2008.0,86519309.0,2007.0,...,4.0,1.0,2008.0,2006.0,1.0,6.0,2008.0,2007.0,6.0,43.0
5414,3174844467,2.315520e+09,17239.0,2.102017e+09,2014.0,865915315.0,101-150,2013.0,887064364.0,2014.0,...,23.0,14.0,2014.0,2014.0,18.0,11.0,2013.0,2014.0,10.0,125.0
5415,3175435814,2.137476e+09,18203.0,2.519517e+09,2015.0,159247623.0,801-900,2015.0,36243813.0,2009.0,...,9.0,15.0,2015.0,2015.0,13.0,28.0,2015.0,2011.0,27.0,850.0
5416,3176125681,2.168565e+09,4333.0,1.789963e+09,2004.0,125602781.0,601-700,2000.0,317356780.0,2001.0,...,8.0,201.0,2004.0,2004.0,197.0,31.0,2004.0,2001.0,29.0,650.0


In [11]:
# Loading regression sample

df_regression_sample = pd.read_csv(indir+"/old_RW_Authors_forRegression_rematching.csv",
                                  usecols=['MAGAID','Record ID',
                                          'RetractorMajority']).drop_duplicates()


indir2 = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/"

df_filtered_sample = pd.read_csv(indir2+"RW_authors_w_confounders_filteredSample_postNHB_BedoorsCorrections_Augmented.csv",
                                usecols=['MAGAID', 'Record ID', 'MAGAIDRankTypeInRetractedPaper',
                                        'ReasonPropagatedMajorityOfMajority']).drop_duplicates()

df_regression_sample = df_filtered_sample.merge(df_regression_sample, on=['MAGAID','Record ID'], how='left')\
                                            .rename(columns={'DemiDecade':'DemiDecadeOfRetraction',
                                                   'MAGAIDRankTypeInRetractedPaper':'MAGAIDFirstORLastAuthorFlag'})\
                                            .replace({'First or Last or Only Author':'MAGFirstOrLastAuthor',
                                                'Middle Author':'MAGMiddleAuthor'})
df_regression_sample

Unnamed: 0,MAGAID,Record ID,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority
0,2127983451,2343,mistake,MAGFirstOrLastAuthor,
1,1986180616,3294,misconduct,MAGFirstOrLastAuthor,
2,2134970185,3489,mistake,MAGMiddleAuthor,
3,2600580187,3631,mistake,MAGFirstOrLastAuthor,author
4,257122240,2202,misconduct,MAGFirstOrLastAuthor,author
...,...,...,...,...,...
18073,384584067,16484,mistake,MAGFirstOrLastAuthor,
18074,2776957463,1994,other,MAGFirstOrLastAuthor,journal
18075,2115767897,1994,other,MAGMiddleAuthor,journal
18076,582542066,6557,other,MAGMiddleAuthor,journal


In [12]:
df_regression_sample[~df_regression_sample['RetractorMajority'].isna()]['MAGAID'].nunique()

4981

In [13]:
def merge_strataVars(dfi):
    dfi_w_strataVars = dfi.merge(df_regression_sample,on=['MAGAID','Record ID'])
    assert(dfi_w_strataVars.MAGAID.nunique() == dfi.MAGAID.nunique())
    return dfi_w_strataVars.drop_duplicates()

dfmatched_10_perc_wStrataVars = merge_strataVars(dfmatched_10_perc_wConfounders)
dfmatched_20_perc_wStrataVars = merge_strataVars(dfmatched_20_perc_wConfounders)
dfmatched_30_perc_wStrataVars = merge_strataVars(dfmatched_30_perc_wConfounders)

dfmatched_30_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumCitationsYearAtRetraction,MatchMAGCumCitationsAtRetraction,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,2008.0,6.0,22.0,2008.0,2008.0,23.0,175.0,other,MAGMiddleAuthor,
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,2008.0,6.0,22.0,2008.0,2008.0,21.0,175.0,other,MAGMiddleAuthor,
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,2012.0,18.0,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,2012.0,18.0,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author
4,9474215,2.169122e+09,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,79576946.0,1999.0,...,2015.0,5692.0,284.0,2015.0,2015.0,207.0,17.0,mistake,MAGMiddleAuthor,author
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5413,3174447547,1.897294e+09,23452.0,2.166758e+09,2008.0,99065089.0,43,2008.0,86519309.0,2007.0,...,2006.0,1.0,6.0,2008.0,2007.0,6.0,43.0,other,MAGFirstOrLastAuthor,other
5414,3174844467,2.315520e+09,17239.0,2.102017e+09,2014.0,865915315.0,101-150,2013.0,887064364.0,2014.0,...,2014.0,18.0,11.0,2013.0,2014.0,10.0,125.0,misconduct,MAGMiddleAuthor,
5415,3175435814,2.137476e+09,18203.0,2.519517e+09,2015.0,159247623.0,801-900,2015.0,36243813.0,2009.0,...,2015.0,13.0,28.0,2015.0,2011.0,27.0,850.0,plagiarism,MAGFirstOrLastAuthor,
5416,3176125681,2.168565e+09,4333.0,1.789963e+09,2004.0,125602781.0,601-700,2000.0,317356780.0,2001.0,...,2004.0,197.0,31.0,2004.0,2001.0,29.0,650.0,mistake,MAGMiddleAuthor,


In [14]:
dfmatched_30_perc_wStrataVars.ReasonPropagatedMajorityOfMajority.value_counts()

ReasonPropagatedMajorityOfMajority
misconduct    1776
plagiarism    1614
mistake       1180
other          848
Name: count, dtype: int64

### Academic age

In [15]:
def compute_activityBin(row):
    if(row.AcademicAgeBeforeRetraction <= 1):
        return "1"
    elif(row.AcademicAgeBeforeRetraction <= 2):
        return "2"
    elif(row.AcademicAgeBeforeRetraction <= 5):
        return "3-5"
    else:
        return ">5"

def augment_age(dfi):
    dfi['AcademicAgeBeforeRetraction'] = dfi['RetractionYear'] - dfi['MAGFirstPubYear']
    dfi['AcademicAgeBin'] = dfi.apply(lambda row: compute_activityBin(row), axis=1)
    return dfi

dfmatched_10_perc_wStrataVars = augment_age(dfmatched_10_perc_wStrataVars)
dfmatched_20_perc_wStrataVars = augment_age(dfmatched_20_perc_wStrataVars)
dfmatched_30_perc_wStrataVars = augment_age(dfmatched_30_perc_wStrataVars)


dfmatched_10_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,22.0,2008.0,2008.0,23.0,175.0,other,MAGMiddleAuthor,,2.0,2
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,22.0,2008.0,2008.0,21.0,175.0,other,MAGMiddleAuthor,,2.0,2
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5
4,47570122,2.063571e+09,1202.0,1.975776e+09,2015.0,126744593.0,201-300,2015.0,100532134.0,2015.0,...,39.0,2015.0,2015.0,38.0,250.0,misconduct,MAGMiddleAuthor,author,5.0,3-5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3617,3169533057,2.413698e+09,17031.0,1.140967e+08,2013.0,68947357.0,101-150,2013.0,24943067.0,2013.0,...,13.0,2013.0,2013.0,14.0,125.0,misconduct,MAGFirstOrLastAuthor,,4.0,3-5
3618,3173543754,1.840802e+09,9548.0,2.045218e+09,2011.0,193775966.0,201-300,2011.0,45084792.0,2010.0,...,38.0,2011.0,2010.0,35.0,250.0,misconduct,MAGFirstOrLastAuthor,journal,6.0,>5
3619,3174447547,1.897294e+09,23452.0,2.166758e+09,2008.0,99065089.0,43,2008.0,86519309.0,2007.0,...,6.0,2008.0,2007.0,6.0,43.0,other,MAGFirstOrLastAuthor,other,2.0,2
3620,3176125681,2.168565e+09,4333.0,1.789963e+09,2004.0,125602781.0,601-700,2000.0,317356780.0,2001.0,...,31.0,2004.0,2001.0,29.0,650.0,mistake,MAGMiddleAuthor,,8.0,>5


In [18]:
dfmatched_10_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,22.0,2008.0,2008.0,23.0,175.0,other,MAGMiddleAuthor,,2.0,2
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,22.0,2008.0,2008.0,21.0,175.0,other,MAGMiddleAuthor,,2.0,2
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,16.0,2012.0,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5
4,47570122,2.063571e+09,1202.0,1.975776e+09,2015.0,126744593.0,201-300,2015.0,100532134.0,2015.0,...,39.0,2015.0,2015.0,38.0,250.0,misconduct,MAGMiddleAuthor,author,5.0,3-5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3617,3169533057,2.413698e+09,17031.0,1.140967e+08,2013.0,68947357.0,101-150,2013.0,24943067.0,2013.0,...,13.0,2013.0,2013.0,14.0,125.0,misconduct,MAGFirstOrLastAuthor,,4.0,3-5
3618,3173543754,1.840802e+09,9548.0,2.045218e+09,2011.0,193775966.0,201-300,2011.0,45084792.0,2010.0,...,38.0,2011.0,2010.0,35.0,250.0,misconduct,MAGFirstOrLastAuthor,journal,6.0,>5
3619,3174447547,1.897294e+09,23452.0,2.166758e+09,2008.0,99065089.0,43,2008.0,86519309.0,2007.0,...,6.0,2008.0,2007.0,6.0,43.0,other,MAGFirstOrLastAuthor,other,2.0,2
3620,3176125681,2.168565e+09,4333.0,1.789963e+09,2004.0,125602781.0,601-700,2000.0,317356780.0,2001.0,...,31.0,2004.0,2001.0,29.0,650.0,mistake,MAGMiddleAuthor,,8.0,>5


## Augmenting Field Name

In [19]:
df_fieldnames = pd.read_csv(indir+"RootFieldsNames.txt").\
                    rename(columns={'root_FieldID':'MAGrootFID',
                                   'FieldName': 'MAGFieldName'})

dfmatched_10_perc_wStrataVars = dfmatched_10_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

dfmatched_20_perc_wStrataVars = dfmatched_20_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

dfmatched_30_perc_wStrataVars = dfmatched_30_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

## Mapping Field Names to STEM and non-STEM

In [20]:
# Classifying fields with < 5% as other stem and non-stem
other_stem_fields = ['materials science', 'computer science',
                'engineering', 'mathematics', 'psychology',
                'economics', 'environmental science']

non_stem_fields = ['psychology','political science', 'geology',
                  'philosophy','geography','sociology','business',
                  'history','art']

dfmatched_10_perc_wStrataVars['MAGFieldName'] = dfmatched_10_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))

dfmatched_20_perc_wStrataVars['MAGFieldName'] = dfmatched_20_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))

dfmatched_30_perc_wStrataVars['MAGFieldName'] = dfmatched_30_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))


## Mapping Retractor Majority NaNs to Other

In [21]:
def impute_retractor_majority_NaNs(dfj):
    dfj['RetractorMajority'] = dfj['RetractorMajority'].fillna('other retractor')
    dfj['RetractorMajority'] = dfj['RetractorMajority'].replace({'other':'other retractor'})
    return dfj
    
    
dfmatched_10_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_10_perc_wStrataVars)
dfmatched_20_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_20_perc_wStrataVars)
dfmatched_30_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_30_perc_wStrataVars)

## Mapping Affiliation Ranks

In [22]:
def map_affiliation_ranks(dfj, col):
    
    mapping = {'101-150':'101-500',
              '151-200':'101-500',
              '201-300':'101-500',
              '301-400':'101-500',
              '401-500':'101-500',
              '501-600':'501-1000',
              '601-700':'501-1000',
              '701-800':'501-1000',
              '801-900':'501-1000',
              '901-1000':'501-1000',
              '1001-':'>1000',}
    
    dfj[col+'Stratified'] = dfj[col].map(mapping).fillna('1-100')
    
    return dfj


dfmatched_10_perc_wStrataVars = map_affiliation_ranks(dfmatched_10_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_20_perc_wStrataVars = map_affiliation_ranks(dfmatched_20_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_30_perc_wStrataVars = map_affiliation_ranks(dfmatched_30_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_30_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,2008.0,23.0,175.0,other,MAGMiddleAuthor,other retractor,2.0,2,biology,101-500
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,2008.0,21.0,175.0,other,MAGMiddleAuthor,other retractor,2.0,2,biology,101-500
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,2011.0,15.0,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5,biology,101-500
3,9474215,2.169122e+09,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,79576946.0,1999.0,...,2015.0,207.0,17.0,mistake,MAGMiddleAuthor,author,22.0,>5,biology,1-100
4,13737004,2.311431e+09,3344.0,2.035632e+09,2014.0,186903577.0,701-800,2013.0,22248866.0,2011.0,...,2011.0,24.0,750.0,mistake,MAGMiddleAuthor,other retractor,10.0,>5,biology,501-1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5413,2482103685,2.161710e+09,18635.0,2.081956e+09,2012.0,191208505.0,201-300,2012.0,8961855.0,2012.0,...,2012.0,10.0,250.0,other,MAGMiddleAuthor,other retractor,1.0,1,non-STEM fields,101-500
5414,2482103685,2.483120e+09,18635.0,2.081956e+09,2012.0,191208505.0,201-300,2012.0,75027704.0,2012.0,...,2012.0,10.0,250.0,other,MAGMiddleAuthor,other retractor,1.0,1,non-STEM fields,101-500
5415,2501052294,2.149938e+09,18635.0,2.081956e+09,2012.0,191208505.0,201-300,2012.0,162868743.0,2011.0,...,2012.0,10.0,250.0,other,MAGMiddleAuthor,other retractor,1.0,1,non-STEM fields,101-500
5416,2714015866,2.310975e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,108290504.0,2008.0,...,2008.0,4.0,175.0,plagiarism,MAGMiddleAuthor,other retractor,0.0,1,non-STEM fields,101-500


In [23]:
dfmatched_30_perc_wStrataVars[['MAGRetractionYearAffRank','MAGRetractionYearAffRankStratified']].head(30)

Unnamed: 0,MAGRetractionYearAffRank,MAGRetractionYearAffRankStratified
0,151-200,101-500
1,151-200,101-500
2,401-500,101-500
3,17,1-100
4,701-800,501-1000
5,151-200,101-500
6,201-300,101-500
7,301-400,101-500
8,301-400,101-500
9,601-700,501-1000


## Augmenting outcome variables

In [24]:
dfmatched_10_perc_wStrataVars.columns

Index(['MAGAID', 'MatchMAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffRank',
       'MAGRetractionYearAffYear', 'MatchMAGRetractionYearAffID',
       'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
       'MatchMAGMaxRetractionYear', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent', 'GenderizeGender',
       'MAGFirstPubYear', 'MAGFirstAffID', 'MAGFirstAffiliationRank',
       'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
       'MatchMAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MatchMAGCumPapersYearAtRetraction',
       'MatchMAGCumPapersAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction',
       'MatchMAGCumCollaborat

In [25]:
# Creating treatment and control

df_treatment = dfmatched_30_perc_wStrataVars.drop(columns=['MatchMAGAID','MatchMAGRetractionYearAffID',
               'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
               'MatchMAGMaxRetractionYear','MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent',
                'MatchMAGCumPapersYearAtRetraction', 'MatchMAGCumPapersAtRetraction',
               'MatchMAGCumCitationsYearAtRetraction', 'MatchMAGCumCitationsAtRetraction',
               'MatchMAGCumCollaboratorsYearAtRetraction',
               'MatchMAGCumCollaboratorsAtRetraction',
                'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
               'MatchMAGFirstAffiliationRank']).drop_duplicates()

df_control = dfmatched_30_perc_wStrataVars.copy()

In [26]:
df_treatment.columns

Index(['MAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffRank',
       'MAGRetractionYearAffYear', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'GenderizeGender', 'MAGFirstPubYear', 'MAGFirstAffID',
       'MAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction',
       'MAGRetractionYearAffRankOrdinal', 'ReasonPropagatedMajorityOfMajority',
       'MAGAIDFirstORLastAuthorFlag', 'RetractorMajority',
       'AcademicAgeBeforeRetraction', 'AcademicAgeBin', 'MAGFieldName',
       'MAGRetractionYearAffRankStratified'],
      dtype='object')

In [27]:
# Reading the collaborators file
df_1d_collaborators = pd.read_csv(indir+"RW_MAGcollaborators_1stDegree_rematching_woPapersCitationsCollaboratorsAtRetraction_wCollabYear_le2020_30perc.csv")
df_1d_collaborators

Unnamed: 0,MAGAID,ScientistType,MAGCollaborationYear,MAGCollabAID,FirstYearPostRetraction,YearOfAttrition,RetractionYear,AuthorType,YearsBetweenRyearAndFirstActivityPostRetraction
0,2.105038e+09,retracted,1983.0,2004120834,1995.0,2000.0,1994.0,retracted,1.0
1,2.105038e+09,retracted,1983.0,2124401064,1995.0,2000.0,1994.0,retracted,1.0
2,2.105038e+09,retracted,1983.0,2486043001,1995.0,2000.0,1994.0,retracted,1.0
3,2.105038e+09,retracted,1992.0,2124401064,1995.0,2000.0,1994.0,retracted,1.0
4,2.105038e+09,retracted,1992.0,2276877851,1995.0,2000.0,1994.0,retracted,1.0
...,...,...,...,...,...,...,...,...,...
958015,2.294600e+09,matched,2020.0,2972696792,2013.0,2020.0,2012.0,matched,1.0
958016,2.294600e+09,matched,2020.0,3111842016,2013.0,2020.0,2012.0,matched,1.0
958017,2.294600e+09,matched,2020.0,3112134165,2013.0,2020.0,2012.0,matched,1.0
958018,2.294600e+09,matched,2020.0,3112412017,2013.0,2020.0,2012.0,matched,1.0


In [33]:
# Separating collaborators for treatment and control
df_1d_collaborators_treatment = df_1d_collaborators[df_1d_collaborators.ScientistType == 'retracted'].\
                                drop(columns=['ScientistType']).\
                                drop_duplicates() # Not necessary but still

df_1d_collaborators_control = df_1d_collaborators[df_1d_collaborators.ScientistType == 'matched'].\
                                rename(columns={'MAGAID':'MatchMAGAID'}).\
                                drop(columns=['ScientistType']).\
                                drop_duplicates() # Not necessary but still


# Let us only get collaborators for MAGAIDs that are relevant
df_1d_collaborators_treatment = df_1d_collaborators_treatment.\
                                merge(df_treatment[['MAGAID']].drop_duplicates(),
                                on='MAGAID', how='right')

# Now let us augment df_1d_collaborators_control with MAGAID first
# Also only getting collaborators for matches that are useful
df_1d_collaborators_control = df_1d_collaborators_control.\
                                merge(df_control[['MAGAID','MatchMAGAID']].drop_duplicates(),
                                on=['MatchMAGAID'], how='right')

df_1d_collaborators_control.shape

(445250, 9)

In [34]:
df_1d_collaborators_treatment[df_1d_collaborators_treatment.MAGCollaborationYear.isna()]

Unnamed: 0,MAGAID,MAGCollaborationYear,MAGCollabAID,FirstYearPostRetraction,YearOfAttrition,RetractionYear,AuthorType,YearsBetweenRyearAndFirstActivityPostRetraction


In [35]:
df_1d_collaborators_treatment.MAGAID.nunique(),df_1d_collaborators_control.MAGAID.nunique(),df_1d_collaborators_control.MatchMAGAID.nunique()


(2348, 2348, 3881)

### Extracting pre- and post-retraction collaborators with 5 year window

Given the assumption that retraction affects the scientists' reputation for only certain number of years, after which there is a phase of recovery, we conduct our analysis by limiting collaborations to a 5 year window such that we only look at collaborators 5 year in the past and 5 years in the future. 

**VERY important note: earlier we may be dropping authors with no collaborators pre and post. We must add them back**

In [36]:
#Let us first create a prepost flag to check if a collaborator is before or after retraction given 5 year window

def get_prepost_flag(row):
    if(pd.isna(row['MAGCollaborationYear'])):
        return 'pre'
    if(row['MAGCollaborationYear'] <= row['RetractionYear']):
        return 'pre'
    else:
        if((row['MAGCollaborationYear']-row['RetractionYear'])<=5):
            return 'post5'
        else:
            return 'post'


# Now let us apply the get_prepost_flag function to each row for treatment and control

df_1d_collaborators_treatment['PrePostFlag5'] = df_1d_collaborators_treatment.apply(lambda row: get_prepost_flag(row), 
                                                 axis=1)

df_1d_collaborators_control['PrePostFlag5'] = df_1d_collaborators_control.apply(lambda row: get_prepost_flag(row), 
                                             axis=1)

In [640]:
# Now we must first impute NaNs with 

In [37]:
# Now let us extract pre- and post-retraction collaborators as set

# Grouping by MAGAID, gender, and pre-post flag to extract pre, and post- re. collabs.
df_1d_collaborators_treatment_w5 = df_1d_collaborators_treatment.groupby(['MAGAID',
                                          'RetractionYear','PrePostFlag5'])\
                                        ['MAGCollabAID'].apply(set).unstack().reset_index().\
                                        drop(columns=['post'])

df_1d_collaborators_control_w5 = df_1d_collaborators_control.groupby(['MAGAID','MatchMAGAID','RetractionYear',
                                                                   'PrePostFlag5'])\
                                        ['MAGCollabAID'].apply(set).unstack().reset_index().\
                                        drop(columns=['post'])


In [38]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.post5.isna() & 
#                               df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)]

In [39]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre
0,2184860,2.136872e+09,2006.0,"{2142233216, 2138036098, 3171196163, 212547943...","{2938253029, 2601764421, 2128606473, 196349863..."
1,2184860,2.628313e+09,2008.0,"{2118927203, 2159872740, 2653548035, 298556306...","{2100084742, 2275291657, 2143456268, 263429415..."
2,8197726,1.574644e+09,2009.0,"{2167392961, 1424858499, 2103543300, 195557334...","{2167392961, 1955573347, 2334728430, 2107135623}"
3,9474215,2.169122e+09,2002.0,"{3025225222, 2300327943, 2548226058, 277219534...","{2327163137, 822241539, 2616293507, 2576967813..."
4,13737004,2.311431e+09,2014.0,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528..."
...,...,...,...,...,...
4049,3173543754,2.947857e+09,2007.0,"{3057326336, 2157021700, 2607911560, 210492045...","{2108270886, 3072270185, 2500904425, 277850839..."
4050,3174447547,1.897294e+09,2008.0,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635..."
4051,3174844467,2.315520e+09,2014.0,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246..."
4052,3175435814,2.137476e+09,2010.0,"{2006164512, 2196434115, 2149740204, 213981022...","{2170387459, 1823443086, 2120184847, 298839937..."


In [40]:

# Dealing with NaNs, and replacing them with empty set

# For treatment
df_1d_collaborators_treatment_w5['pre'] = df_1d_collaborators_treatment_w5['pre'].\
                                            apply(lambda d: d if isinstance(d, set) else set())

df_1d_collaborators_treatment_w5['post5'] = df_1d_collaborators_treatment_w5['post5'].\
                                                apply(lambda d: d if isinstance(d, set) else set())

# For control
df_1d_collaborators_control_w5['pre'] = df_1d_collaborators_control_w5['pre'].\
                                            apply(lambda d: d if isinstance(d, set) else set())
df_1d_collaborators_control_w5['post5'] = df_1d_collaborators_control_w5['post5'].\
                                            apply(lambda d: d if isinstance(d, set) else set())

In [41]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre
0,2184860,2.136872e+09,2006.0,"{2142233216, 2138036098, 3171196163, 212547943...","{2938253029, 2601764421, 2128606473, 196349863..."
1,2184860,2.628313e+09,2008.0,"{2118927203, 2159872740, 2653548035, 298556306...","{2100084742, 2275291657, 2143456268, 263429415..."
2,8197726,1.574644e+09,2009.0,"{2167392961, 1424858499, 2103543300, 195557334...","{2167392961, 1955573347, 2334728430, 2107135623}"
3,9474215,2.169122e+09,2002.0,"{3025225222, 2300327943, 2548226058, 277219534...","{2327163137, 822241539, 2616293507, 2576967813..."
4,13737004,2.311431e+09,2014.0,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528..."
...,...,...,...,...,...
4049,3173543754,2.947857e+09,2007.0,"{3057326336, 2157021700, 2607911560, 210492045...","{2108270886, 3072270185, 2500904425, 277850839..."
4050,3174447547,1.897294e+09,2008.0,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635..."
4051,3174844467,2.315520e+09,2014.0,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246..."
4052,3175435814,2.137476e+09,2010.0,"{2006164512, 2196434115, 2149740204, 213981022...","{2170387459, 1823443086, 2120184847, 298839937..."


### Extracting number & set of retained collaborators with a 5 year window

In [42]:
# Now let us find the number and set of retained collaborators for both the groups

df_1d_collaborators_treatment_w5['NumRetentionW5'] = df_1d_collaborators_treatment_w5.apply(lambda row: \
                                                    len(row.post5.intersection(row.pre)), 
                                                    axis=1)

df_1d_collaborators_treatment_w5['CollabAIDRetainedW5'] = df_1d_collaborators_treatment_w5.apply(lambda row: \
                                                    row.post5.intersection(row.pre), 
                                                    axis=1)

df_1d_collaborators_control_w5['NumRetentionW5'] = df_1d_collaborators_control_w5.apply(lambda row: \
                                                len(row.post5.intersection(row.pre)), 
                                                axis=1)

df_1d_collaborators_control_w5['CollabAIDRetainedW5'] = df_1d_collaborators_control_w5.apply(lambda row: \
                                                    row.post5.intersection(row.pre), 
                                                    axis=1)

In [43]:
df_1d_collaborators_treatment_w5

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5
0,2.184860e+06,2008.0,"{1749937409, 2149399999, 2617454923, 260192180...","{2118754826, 2601921804, 2665868049, 230519298...",6,"{2617454923, 2601921804, 2032600174, 230576087..."
1,8.197726e+06,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}"
2,9.474215e+06,2015.0,"{2741493764, 1134876678, 2589376522, 288503604...","{2102010368, 2147699201, 2402001923, 277873818...",55,"{2102010368, 2147699201, 2402001923, 211700557..."
3,1.373700e+07,2014.0,"{2144558976, 2068504066, 1986642243, 146813190...","{2144558976, 1456934145, 2050002051, 146813190...",8,"{2144558976, 1986642243, 146813190, 2304798023..."
4,1.551904e+07,2013.0,"{2716086216, 2104789297, 67380540, 2546151959}","{2648287456, 2318811809, 2332026499, 242479555...",0,{}
...,...,...,...,...,...,...
2343,3.173544e+09,2011.0,"{3120802076, 2236143686}","{2096029443, 2160112133, 2158121610, 230907623...",0,{}
2344,3.174448e+09,2008.0,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",1,{2561941943}
2345,3.174844e+09,2014.0,"{2636262617, 550125002, 2295148299, 2265510894...","{2954067065, 2311908582, 2005715177, 550125002...",3,"{2174600848, 2636262617, 550125002}"
2346,3.175436e+09,2015.0,"{1805786912, 2999619457, 2658197410, 257933920...","{2130470407, 2395301650, 1455333013, 231293572...",1,{2121913688}


In [44]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5
0,2184860,2.136872e+09,2006.0,"{2142233216, 2138036098, 3171196163, 212547943...","{2938253029, 2601764421, 2128606473, 196349863...",4,"{2128606473, 2136984498, 1963498635, 1600873685}"
1,2184860,2.628313e+09,2008.0,"{2118927203, 2159872740, 2653548035, 298556306...","{2100084742, 2275291657, 2143456268, 263429415...",8,"{2118927203, 2159872740, 2275291657, 268779959..."
2,8197726,1.574644e+09,2009.0,"{2167392961, 1424858499, 2103543300, 195557334...","{2167392961, 1955573347, 2334728430, 2107135623}",3,"{2167392961, 1955573347, 2107135623}"
3,9474215,2.169122e+09,2002.0,"{3025225222, 2300327943, 2548226058, 277219534...","{2327163137, 822241539, 2616293507, 2576967813...",26,"{822241539, 2154252681, 2517321866, 2772195341..."
4,13737004,2.311431e+09,2014.0,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528...",1,{152283727}
...,...,...,...,...,...,...,...
4049,3173543754,2.947857e+09,2007.0,"{3057326336, 2157021700, 2607911560, 210492045...","{2108270886, 3072270185, 2500904425, 277850839...",2,"{2138560915, 2123453852}"
4050,3174447547,1.897294e+09,2008.0,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635...",1,{2074793029}
4051,3174844467,2.315520e+09,2014.0,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246...",6,"{2674099783, 2297775246, 2133515217, 279778914..."
4052,3175435814,2.137476e+09,2010.0,"{2006164512, 2196434115, 2149740204, 213981022...","{2170387459, 1823443086, 2120184847, 298839937...",6,"{2006164512, 2149740204, 1823443086, 257226533..."


In [45]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)].NumRetentionW5.value_counts()

### Extracting number & set of new collaborators with a 5 year window

In [46]:
# Now let us compute the number of new collaborators

# We can compute them by subtracting pre-retraction collaborators' set from post-retraction collaborators' set
def extract_num_newCollab(row):
    return len(row['post5']-row['pre'])

def extract_newCollab(row):
    return row['post5']-row['pre']

# computing number and set of new collaborators
df_1d_collaborators_treatment_w5['NumNewCollaboratorsW5'] = df_1d_collaborators_treatment_w5\
                                                            .apply(lambda row: extract_num_newCollab(row), 
                                                               axis=1)

df_1d_collaborators_treatment_w5['CollabAIDGainedW5'] = df_1d_collaborators_treatment_w5\
                                                            .apply(lambda row: extract_newCollab(row), 
                                                               axis=1)

df_1d_collaborators_control_w5['NumNewCollaboratorsW5'] = df_1d_collaborators_control_w5.apply(lambda row: extract_num_newCollab(row), 
                                                       axis=1)

df_1d_collaborators_control_w5['CollabAIDGainedW5'] = df_1d_collaborators_control_w5.apply(lambda row: extract_newCollab(row), 
                                                           axis=1)



In [47]:
df_1d_collaborators_treatment_w5.head(2)

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,2184860.0,2008.0,"{1749937409, 2149399999, 2617454923, 260192180...","{2118754826, 2601921804, 2665868049, 230519298...",6,"{2617454923, 2601921804, 2032600174, 230576087...",5,"{1749937409, 1517361100, 2798255860, 251012216..."
1,8197726.0,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}",10,"{1969204096, 2226225926, 1993638631, 256742932..."


In [48]:
df_1d_collaborators_control_w5.head(2)

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,2184860,2136872000.0,2006.0,"{2142233216, 2138036098, 3171196163, 212547943...","{2938253029, 2601764421, 2128606473, 196349863...",4,"{2128606473, 2136984498, 1963498635, 1600873685}",34,"{2142233216, 2138036098, 3171196163, 212547943..."
1,2184860,2628313000.0,2008.0,"{2118927203, 2159872740, 2653548035, 298556306...","{2100084742, 2275291657, 2143456268, 263429415...",8,"{2118927203, 2159872740, 2275291657, 268779959...",10,"{2653548035, 1989641415, 2717105898, 249522852..."


### Merging num retention and num new collaborators with treatment and control

In [49]:

df_treatment_augmented = df_treatment.\
                merge(df_1d_collaborators_treatment_w5[['MAGAID','NumRetentionW5',
                                                        'NumNewCollaboratorsW5']].drop_duplicates(),
                     on='MAGAID')

df_treatment_augmented

Unnamed: 0,MAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MAGrootFID,MAGrootFIDMaxPercent,GenderizeGender,...,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5
0,2184860,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,86803240.0,0.375000,female,...,175.0,other,MAGMiddleAuthor,other retractor,2.0,2,biology,101-500,6,5
1,8197726,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,86803240.0,0.333333,female,...,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5,biology,101-500,4,10
2,8197726,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185592680.0,0.333333,female,...,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5,chemistry,101-500,4,10
3,9474215,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,86803240.0,0.340102,male,...,17.0,mistake,MAGMiddleAuthor,author,22.0,>5,biology,1-100,55,481
4,13737004,3344.0,2.035632e+09,2014.0,186903577.0,701-800,2013.0,86803240.0,0.320000,male,...,750.0,mistake,MAGMiddleAuthor,other retractor,10.0,>5,biology,501-1000,8,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,2152217456,8166.0,2.002880e+09,2014.0,154851008.0,901-1000,2014.0,127313418.0,0.220779,male,...,950.0,other,MAGFirstOrLastAuthor,journal,37.0,>5,non-STEM fields,501-1000,6,188
2804,2160882320,16554.0,2.014362e+09,2014.0,9360294.0,201-300,2014.0,127313418.0,0.333333,male,...,250.0,misconduct,MAGMiddleAuthor,other retractor,6.0,>5,non-STEM fields,101-500,8,12
2805,2162517952,8543.0,2.121510e+09,2013.0,66906201.0,901-1000,2012.0,127313418.0,0.230769,female,...,950.0,other,MAGMiddleAuthor,other retractor,3.0,3-5,non-STEM fields,501-1000,5,13
2806,2195113989,18557.0,2.045810e+09,2008.0,298625061.0,301-400,2008.0,127313418.0,0.285714,male,...,350.0,misconduct,MAGFirstOrLastAuthor,journal,3.0,3-5,non-STEM fields,101-500,1,4


In [50]:
df_control_augmented = df_control.\
                merge(df_1d_collaborators_control_w5[['MAGAID','MatchMAGAID','NumRetentionW5',
                                                        'NumNewCollaboratorsW5']].drop_duplicates(),
                     on=['MAGAID','MatchMAGAID'])

df_control_augmented

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MAGRetractionYearAffRankOrdinal,ReasonPropagatedMajorityOfMajority,MAGAIDFirstORLastAuthorFlag,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,175.0,other,MAGMiddleAuthor,other retractor,2.0,2,biology,101-500,8,10
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,175.0,other,MAGMiddleAuthor,other retractor,2.0,2,biology,101-500,4,34
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5,biology,101-500,3,11
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,450.0,mistake,MAGMiddleAuthor,author,5.0,3-5,chemistry,101-500,3,11
4,9474215,2.169122e+09,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,79576946.0,1999.0,...,17.0,mistake,MAGMiddleAuthor,author,22.0,>5,biology,1-100,26,71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5413,2714015866,2.110080e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,71999127.0,2008.0,...,175.0,plagiarism,MAGMiddleAuthor,other retractor,0.0,1,non-STEM fields,101-500,4,4
5414,2714015866,2.110080e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,71999127.0,2008.0,...,175.0,plagiarism,MAGMiddleAuthor,other retractor,0.0,1,non-STEM fields,101-500,4,4
5415,2714015866,3.168352e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,204983213.0,2008.0,...,175.0,plagiarism,MAGMiddleAuthor,other retractor,0.0,1,non-STEM fields,101-500,2,3
5416,2947073985,3.019838e+08,16554.0,2.014362e+09,2014.0,32021983.0,201-300,2014.0,138689650.0,2014.0,...,250.0,misconduct,MAGFirstOrLastAuthor,other retractor,21.0,>5,non-STEM fields,101-500,41,141


In [51]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)].NumNewCollaboratorsW5.value_counts()

### Triadic Closure

In [52]:
# Let us first read the triadic closure files

triadic_closure_path = '/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_effects_on_collaboration_networks/data/main/triadicClosure/rematching/nonattrited_matches/'
import os

flist = os.listdir(triadic_closure_path)

dfs = []
for f in flist:
    df = pd.read_csv(triadic_closure_path+f, usecols=['MAGAID','NumOpenTriads','RetractionYear','NumTriadsClosed','NC'])
    dfs.append(df)
    
df_triads = pd.concat(dfs)

df_triads.head()


Unnamed: 0,MAGAID,RetractionYear,NumOpenTriads,NumTriadsClosed,NC
0,2642974000.0,2014,126,0,0.0
1,2643290000.0,2015,66,0,0.0
2,2643319000.0,2013,262,0,0.0
3,2643339000.0,1996,462,0,0.0
4,2643419000.0,2015,1,0,0.0


0       {2302816209, 380396913, 2704301283, 2171151900}
1     {2083140866, 2714291460, 2660419465, 176157044...
2     {2892668162, 2115960963, 2130692996, 214375718...
3     {2435970729, 2555864395, 1999676908, 244266199...
4                              {2694046386, 2487858741}
                            ...                        
73    {2559878403, 2689813638, 2224876938, 230560935...
74    {2303152128, 3053834756, 3176056334, 246618574...
75    {2388327812, 2204176390, 2633631111, 305493902...
76    {2129257346, 261591564, 2171076502, 2169281304...
77    {2127752524, 2443649074, 2580435315, 258202738...
Name: pre, Length: 7796, dtype: object

In [53]:
df_triads.MAGAID.nunique() # because it contains matches too

7783

In [54]:
df_treatment_augmented = df_treatment_augmented.merge(df_triads, on='MAGAID', how='left')
print(df_treatment_augmented[df_treatment_augmented.NC.isna()].MAGAID.nunique())
df_treatment_augmented['NC'] = df_treatment_augmented['NC'].fillna(0)


df_control_augmented = df_control_augmented.merge(df_triads.rename(columns={'MAGAID':'MatchMAGAID'}), 
                                                                    on=['MatchMAGAID','RetractionYear'], how='left')
                                                  
print(df_control_augmented[df_control_augmented.NC.isna()].MatchMAGAID.nunique())

df_control_augmented['NC'] = df_control_augmented['NC'].fillna(0)

350
2363


In [55]:
df_control_augmented

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffRank,MAGRetractionYearAffYear,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,RetractorMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5,NumOpenTriads,NumTriadsClosed,NC
0,2184860,2.628313e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,881766915.0,2008.0,...,other retractor,2.0,2,biology,101-500,8,10,,,0.000000
1,2184860,2.136872e+09,15835.0,2.609888e+09,2008.0,861853513.0,151-200,2008.0,205349734.0,2008.0,...,other retractor,2.0,2,biology,101-500,4,34,,,0.000000
2,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,author,5.0,3-5,biology,101-500,3,11,,,0.000000
3,8197726,1.574644e+09,3444.0,1.506358e+08,2012.0,13955877.0,401-500,2012.0,185443292.0,2011.0,...,author,5.0,3-5,chemistry,101-500,3,11,,,0.000000
4,9474215,2.169122e+09,7285.0,1.985944e+09,2015.0,79576946.0,17,2003.0,79576946.0,1999.0,...,author,22.0,>5,biology,1-100,26,71,,,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5413,2714015866,2.110080e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,71999127.0,2008.0,...,other retractor,0.0,1,non-STEM fields,101-500,4,4,35.0,0.0,0.000000
5414,2714015866,2.110080e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,71999127.0,2008.0,...,other retractor,0.0,1,non-STEM fields,101-500,4,4,35.0,0.0,0.000000
5415,2714015866,3.168352e+09,7423.0,2.147214e+09,2008.0,129774422.0,151-200,2008.0,204983213.0,2008.0,...,other retractor,0.0,1,non-STEM fields,101-500,2,3,7.0,2.0,0.285714
5416,2947073985,3.019838e+08,16554.0,2.014362e+09,2014.0,32021983.0,201-300,2014.0,138689650.0,2014.0,...,other retractor,21.0,>5,non-STEM fields,101-500,41,141,202.0,0.0,0.000000


In [56]:
df_treatment_augmented.to_csv(indir+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv", index=False)
df_control_augmented.to_csv(indir+"/RWMAG_rematched_control_augmented_rematching_30perc.csv", index=False)

In [57]:
df_control_augmented.MatchMAGAID.nunique()

3881

In [58]:
df_control_augmented.MAGAID.nunique()

2348

In [59]:
df_treatment_augmented.MAGAID.nunique()

2348