# Data Dropping & Augmentation: Creating Matching Analytical Samples

In this notebook, we shall augment some necessary columns within our dataframe before we conduct the matching analysis.

**First we shall drop the unnecessary columns. This is being done after Bedoor and Kinga asked me to do closest matching with distance measure (Almost right before submission to PNAS).**


Here are the steps we will take:

We shall create the matching analytical sample for paper, citation, collaborators distance of (i) 10%, (ii) 20%, and (iii) 30%. To do that the following steps will be taken:

1. First, we shall load the three matching files for 10%, 20% and 30%. 
2. Then, we shall load the confounders that were matched for treatment and control using the file: **RWMatched_intersection_wPapersCitationsCollaborators_wCollabYear_closestMatch30.csv**
3. Then we shall load the stratification variables: reason, time of retraction, order of author in the retracted paper, type of retraction, author academic age, map author affiliation rank, impute retractor majority to avoid NaNs using the file: **filtered_sample.csv**
4. Finally, we shall compute the outcome variables using the files: **RW_MAGcollaborators_1stDegree_rematching_woPapersCitationsCollaborators_wCollabYear_le2020_closestMatch30.csv** and **To be filled**. The outcome variables are
    1. Number of collaborators retained by authors and their matches
    2. Number of collaborators gained by authors and their matches
    3. Number of triads closed by authors and their matches
    4. Proportion of triads closed by authors and their matches (Newman's Coefficient or NC)


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Step 1:  load the three matching files for 10%, 20% and 30%.

relevant_cols = ['MAGAID', 'MatchMAGAID', 'Record ID',]
dfmatched_10perc = pd.read_csv("..//matching/closestMatch/nonattrited_matches/closestAverageMatch_tolerance_0.1_w_0.8.csv",
                                usecols=relevant_cols).drop_duplicates()
dfmatched_20perc = pd.read_csv("..//matching/closestMatch/nonattrited_matches/closestAverageMatch_tolerance_0.2_w_0.8.csv",
                                usecols=relevant_cols).drop_duplicates()

dfmatched_30perc = pd.read_csv("..//matching/closestMatch/nonattrited_matches/closestAverageMatch_tolerance_0.3_w_0.8.csv",
                              usecols=relevant_cols).drop_duplicates()


In [3]:
dfmatched_30perc[dfmatched_30perc.MatchMAGAID.duplicated()]

Unnamed: 0,MAGAID,MatchMAGAID,Record ID
218,1.139622e+09,1.936235e+06,2224
253,1.382559e+09,1.936235e+06,4934
378,1.950968e+09,2.617699e+09,16749
890,2.094538e+09,2.027876e+09,4732
1327,2.115123e+09,2.144725e+09,4442
...,...,...,...
4182,2.798117e+09,2.568039e+09,1405
4452,3.032578e+09,2.662998e+09,17329
4453,3.032578e+09,2.525951e+09,17329
4454,3.034386e+09,2.151963e+09,4031


In [4]:
dfmatched_10perc.columns

Index(['MAGAID', 'MatchMAGAID', 'Record ID'], dtype='object')

In [5]:
dfmatched_10perc.shape, dfmatched_20perc.shape, dfmatched_30perc.shape

((2233, 3), (3631, 3), (4564, 3))

In [6]:
dfmatched_10perc.MAGAID.nunique(), dfmatched_20perc.MAGAID.nunique(), dfmatched_30perc.MAGAID.nunique()

(832, 2181, 3094)

In [7]:
# Step2: load the confounders that were matched for treatment and control

df_confounders = pd.read_csv("../../data/main/RWMatched_intersection_wPapersCitationsCollaboratorsAtRetraction_wCollabYear_closestMatch30_nonattrited_matches.csv")

df_confounders.columns
    

Index(['MAGAID', 'MatchMAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffYear',
       'MAGRetractionYearAffRank', 'MatchMAGRetractionYearAffID',
       'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
       'MatchMAGMaxRetractionYear', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent', 'GenderizeGender',
       'MAGFirstPubYear', 'MAGFirstAffID', 'MAGFirstAffiliationRank',
       'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
       'MatchMAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MatchMAGCumPapersYearAtRetraction',
       'MatchMAGCumPapersAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction',
       'MatchMAGCumCollaborat

In [8]:
def merge_confounders(dfi):
    dfi_w_confounders = dfi.merge(df_confounders,on=['MAGAID','MatchMAGAID','Record ID'])
    assert(dfi_w_confounders.MAGAID.nunique() == dfi.MAGAID.nunique())
    assert(dfi_w_confounders.MatchMAGAID.nunique() == dfi.MatchMAGAID.nunique())
    return dfi_w_confounders.drop_duplicates()


dfmatched_10_perc_wConfounders = merge_confounders(dfmatched_10perc)
dfmatched_20_perc_wConfounders = merge_confounders(dfmatched_20perc)
dfmatched_30_perc_wConfounders = merge_confounders(dfmatched_30perc)



In [9]:
dfmatched_10_perc_wConfounders

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumPapersYearAtRetraction,MatchMAGCumPapersAtRetraction,MAGCumCitationsAtRetraction,MAGCumCitationsYearAtRetraction,MatchMAGCumCitationsYearAtRetraction,MatchMAGCumCitationsAtRetraction,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,13955877.0,2012.0,401-500,69737025.0,2012.0,...,2012.0,4,24.0,2012.0,2012.0,26.0,16.0,2012.0,2012.0,16.0
1,3.186698e+07,2.130498e+09,17638,2.000971e+09,2012,8087733.0,2012.0,151-200,205349734.0,2011.0,...,2012.0,4,21.0,2012.0,2012.0,19.0,4.0,2011.0,2012.0,4.0
2,4.355984e+07,2.075608e+09,10,2.079880e+09,2014,861853513.0,2013.0,151-200,861853513.0,2014.0,...,2014.0,31,324.0,2014.0,2014.0,296.0,96.0,2014.0,2014.0,105.0
3,4.757012e+07,2.063571e+09,1202,1.975776e+09,2015,126744593.0,2015.0,201-300,100532134.0,2015.0,...,2015.0,13,189.0,2015.0,2015.0,187.0,39.0,2015.0,2015.0,38.0
4,6.796688e+07,2.568572e+09,3344,2.035632e+09,2014,186903577.0,2013.0,701-800,874386039.0,2014.0,...,2014.0,21,309.0,2014.0,2014.0,328.0,30.0,2013.0,2014.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3244,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,2012.0,1,1.0,2013.0,2013.0,1.0,13.0,2012.0,2012.0,13.0
3245,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,2012.0,1,1.0,2013.0,2013.0,1.0,13.0,2012.0,2012.0,13.0
3246,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,2012.0,1,1.0,2013.0,2013.0,1.0,13.0,2012.0,2012.0,13.0
3247,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,2012.0,1,1.0,2013.0,2013.0,1.0,13.0,2012.0,2012.0,13.0


In [10]:
# Loading regression sample

df_regression_sample = pd.read_csv("../../data/h4_altmetric/regression/RW_Authors_forRegression_rematching.csv",
                                  usecols=['MAGAID','Record ID', 'DemiDecade',
                                          'ReasonPropagatedMajorityOfMajority',
                                          'RetractorMajority','MAGAIDRankTypeInRetractedPaper']).drop_duplicates().\
                                    rename(columns={'DemiDecade':'DemiDecadeOfRetraction',
                                                   'MAGAIDRankTypeInRetractedPaper':'MAGAIDFirstORLastAuthorFlag'}).\
                                    replace({'First or Last or Only Author':'MAGFirstOrLastAuthor',
                                            'Middle Author':'MAGMiddleAuthor'})
df_regression_sample

Unnamed: 0,Record ID,MAGAID,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority
0,3031,2.111744e+09,MAGFirstOrLastAuthor,1990-1995,author,mistake
1,3031,2.245003e+09,MAGFirstOrLastAuthor,1990-1995,author,mistake
2,1082,2.120727e+09,MAGMiddleAuthor,1990-1995,,mistake
3,1082,2.151686e+09,MAGFirstOrLastAuthor,1990-1995,,mistake
4,1082,2.552715e+09,MAGMiddleAuthor,1990-1995,,mistake
...,...,...,...,...,...,...
34708,8314,1.979824e+09,MAGFirstOrLastAuthor,2011-2015,,mistake
34710,2835,1.972149e+09,MAGFirstOrLastAuthor,2011-2015,,misconduct
34715,16836,2.650217e+09,MAGFirstOrLastAuthor,2011-2015,,other
34716,16836,2.690000e+09,MAGFirstOrLastAuthor,2011-2015,,other


In [11]:
def merge_strataVars(dfi):
    dfi_w_strataVars = dfi.merge(df_regression_sample,on=['MAGAID','Record ID'])
    assert(dfi_w_strataVars.MAGAID.nunique() == dfi.MAGAID.nunique())
    return dfi_w_strataVars.drop_duplicates()

dfmatched_10_perc_wStrataVars = merge_strataVars(dfmatched_10_perc_wConfounders)
dfmatched_20_perc_wStrataVars = merge_strataVars(dfmatched_20_perc_wConfounders)
dfmatched_30_perc_wStrataVars = merge_strataVars(dfmatched_30_perc_wConfounders)

dfmatched_30_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumCitationsYearAtRetraction,MatchMAGCumCitationsAtRetraction,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,6.973702e+07,2012.0,...,2012.0,26.0,16.0,2012.0,2012.0,16.0,MAGMiddleAuthor,2011-2015,author,mistake
1,1.373700e+07,2.311431e+09,3344,2.035632e+09,2014,1.869036e+08,2013.0,701-800,2.224887e+07,2011.0,...,2014.0,192.0,28.0,2013.0,2011.0,24.0,MAGMiddleAuthor,2011-2015,,mistake
2,1.551904e+07,2.955374e+09,898,2.053834e+09,2013,1.219343e+08,2013.0,151-200,1.673386e+07,2006.0,...,2013.0,28.0,11.0,2010.0,2012.0,13.0,MAGMiddleAuthor,2011-2015,journal,plagiarism
3,1.910029e+07,2.154843e+09,3740,2.011539e+09,2002,3.625896e+07,2002.0,18,1.806702e+08,1994.0,...,2002.0,123.0,29.0,2002.0,2002.0,37.0,MAGFirstOrLastAuthor,2001-2005,,mistake
4,2.127731e+07,7.210135e+08,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,1.294672e+09,2009.0,...,2011.0,273.0,52.0,2011.0,2011.0,58.0,MAGFirstOrLastAuthor,2011-2015,author,mistake
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5840,3.173544e+09,2.893741e+09,9548,2.045218e+09,2011,1.937760e+08,2011.0,201-300,3.942224e+07,2011.0,...,2011.0,57.0,38.0,2011.0,2011.0,38.0,MAGFirstOrLastAuthor,2011-2015,journal,misconduct
5841,3.174427e+09,2.151144e+09,2745,1.979978e+09,2005,1.845657e+08,2003.0,301-400,1.695220e+08,2005.0,...,2005.0,178.0,39.0,2005.0,2005.0,42.0,MAGFirstOrLastAuthor,2001-2005,,mistake
5842,3.174448e+09,1.897294e+09,23452,2.166758e+09,2008,9.906509e+07,2008.0,43,8.651931e+07,2007.0,...,2006.0,1.0,6.0,2008.0,2007.0,6.0,MAGFirstOrLastAuthor,2006-2010,other,other
5843,3.174844e+09,2.315520e+09,17239,2.102017e+09,2014,8.659153e+08,2013.0,101-150,8.870644e+08,2014.0,...,2014.0,18.0,11.0,2013.0,2014.0,10.0,MAGMiddleAuthor,2011-2015,,misconduct


In [12]:
dfmatched_30_perc_wStrataVars.ReasonPropagatedMajorityOfMajority.value_counts()

misconduct    1617
plagiarism    1591
mistake       1341
other         1296
Name: ReasonPropagatedMajorityOfMajority, dtype: int64

### Academic age

In [13]:
def compute_activityBin(row):
    if(row.AcademicAgeBeforeRetraction <= 1):
        return "1"
    elif(row.AcademicAgeBeforeRetraction <= 2):
        return "2"
    elif(row.AcademicAgeBeforeRetraction <= 5):
        return "3-5"
    else:
        return ">5"

def augment_age(dfi):
    dfi['AcademicAgeBeforeRetraction'] = dfi['RetractionYear'] - dfi['MAGFirstPubYear']
    dfi['AcademicAgeBin'] = dfi.apply(lambda row: compute_activityBin(row), axis=1)
    return dfi

dfmatched_10_perc_wStrataVars = augment_age(dfmatched_10_perc_wStrataVars)
dfmatched_20_perc_wStrataVars = augment_age(dfmatched_20_perc_wStrataVars)
dfmatched_30_perc_wStrataVars = augment_age(dfmatched_30_perc_wStrataVars)


dfmatched_10_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MAGCumCollaboratorsAtRetraction,MAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority,AcademicAgeBeforeRetraction,AcademicAgeBin
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,13955877.0,2012.0,401-500,69737025.0,2012.0,...,16.0,2012.0,2012.0,16.0,MAGMiddleAuthor,2011-2015,author,mistake,5.0,3-5
1,3.186698e+07,2.130498e+09,17638,2.000971e+09,2012,8087733.0,2012.0,151-200,205349734.0,2011.0,...,4.0,2011.0,2012.0,4.0,MAGMiddleAuthor,2011-2015,journal,other,1.0,1
2,4.355984e+07,2.075608e+09,10,2.079880e+09,2014,861853513.0,2013.0,151-200,861853513.0,2014.0,...,96.0,2014.0,2014.0,105.0,MAGMiddleAuthor,2011-2015,journal,plagiarism,11.0,>5
3,4.757012e+07,2.063571e+09,1202,1.975776e+09,2015,126744593.0,2015.0,201-300,100532134.0,2015.0,...,39.0,2015.0,2015.0,38.0,MAGMiddleAuthor,2011-2015,author,misconduct,5.0,3-5
4,6.796688e+07,2.568572e+09,3344,2.035632e+09,2014,186903577.0,2013.0,701-800,874386039.0,2014.0,...,30.0,2013.0,2014.0,28.0,MAGFirstOrLastAuthor,2011-2015,,mistake,18.0,>5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3244,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,13.0,2012.0,2012.0,13.0,MAGMiddleAuthor,2011-2015,author,mistake,1.0,1
3245,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,13.0,2012.0,2012.0,13.0,MAGMiddleAuthor,2011-2015,author,mistake,1.0,1
3246,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,13.0,2012.0,2012.0,13.0,MAGMiddleAuthor,2011-2015,author,mistake,1.0,1
3247,3.171336e+09,2.642936e+09,1638,2.153561e+09,2013,154099455.0,2012.0,201-300,27357992.0,2012.0,...,13.0,2012.0,2012.0,13.0,MAGMiddleAuthor,2011-2015,author,mistake,1.0,1


## Augmenting Field Name

In [14]:
df_fieldnames = pd.read_csv("../../data/main/RootFieldsNames.txt").\
                    rename(columns={'root_FieldID':'MAGrootFID',
                                   'FieldName': 'MAGFieldName'})

dfmatched_10_perc_wStrataVars = dfmatched_10_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

dfmatched_20_perc_wStrataVars = dfmatched_20_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

dfmatched_30_perc_wStrataVars = dfmatched_30_perc_wStrataVars.merge(df_fieldnames, 
                                                                   on='MAGrootFID')

## Mapping Field Names to STEM and non-STEM

In [15]:
# Classifying fields with < 5% as other stem and non-stem
other_stem_fields = ['materials science', 'computer science',
                'engineering', 'mathematics', 'psychology',
                'economics', 'environmental science']

non_stem_fields = ['psychology','political science', 'geology',
                  'philosophy','geography','sociology','business',
                  'history','art']

dfmatched_10_perc_wStrataVars['MAGFieldName'] = dfmatched_10_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))

dfmatched_20_perc_wStrataVars['MAGFieldName'] = dfmatched_20_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))

dfmatched_30_perc_wStrataVars['MAGFieldName'] = dfmatched_30_perc_wStrataVars['MAGFieldName']\
                                                   .replace(dict.fromkeys(other_stem_fields, 'other STEM fields'))\
                                                    .replace(dict.fromkeys(non_stem_fields,'non-STEM fields'))


## Mapping Retractor Majority NaNs to Other

In [16]:
def impute_retractor_majority_NaNs(dfj):
    dfj['RetractorMajority'] = dfj['RetractorMajority'].fillna('other retractor')
    dfj['RetractorMajority'] = dfj['RetractorMajority'].replace({'other':'other retractor'})
    return dfj
    
    
dfmatched_10_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_10_perc_wStrataVars)
dfmatched_20_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_20_perc_wStrataVars)
dfmatched_30_perc_wStrataVars = impute_retractor_majority_NaNs(dfmatched_30_perc_wStrataVars)

## Mapping Affiliation Ranks

In [17]:
def map_affiliation_ranks(dfj, col):
    
    mapping = {'101-150':'101-500',
              '151-200':'101-500',
              '201-300':'101-500',
              '301-400':'101-500',
              '401-500':'101-500',
              '501-600':'501-1000',
              '601-700':'501-1000',
              '701-800':'501-1000',
              '801-900':'501-1000',
              '901-1000':'501-1000',
              '1001-':'>1000',}
    
    dfj[col+'Stratified'] = dfj[col].map(mapping).fillna('1-100')
    
    return dfj


dfmatched_10_perc_wStrataVars = map_affiliation_ranks(dfmatched_10_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_20_perc_wStrataVars = map_affiliation_ranks(dfmatched_20_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_30_perc_wStrataVars = map_affiliation_ranks(dfmatched_30_perc_wStrataVars, 'MAGRetractionYearAffRank')
dfmatched_30_perc_wStrataVars

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MatchMAGCumCollaboratorsYearAtRetraction,MatchMAGCumCollaboratorsAtRetraction,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,6.973702e+07,2012.0,...,2012.0,16.0,MAGMiddleAuthor,2011-2015,author,mistake,5.0,3-5,chemistry,101-500
1,2.127731e+07,7.210135e+08,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,1.294672e+09,2009.0,...,2011.0,58.0,MAGFirstOrLastAuthor,2011-2015,author,mistake,16.0,>5,chemistry,>1000
2,3.343381e+07,2.135651e+09,23728,2.055923e+09,2009,1.733049e+08,2009.0,201-300,1.217483e+08,2009.0,...,2009.0,8.0,MAGFirstOrLastAuthor,2006-2010,author,mistake,2.0,2,chemistry,101-500
3,6.373615e+07,2.486812e+09,4317,2.166194e+09,2009,1.733049e+08,2009.0,201-300,1.687197e+08,2009.0,...,2009.0,11.0,MAGMiddleAuthor,2006-2010,other retractor,mistake,2.0,2,chemistry,101-500
4,7.092528e+07,2.949159e+09,8100,2.068127e+09,2014,1.582483e+08,2014.0,501-600,2.007630e+08,2014.0,...,2014.0,19.0,MAGFirstOrLastAuthor,2011-2015,other retractor,plagiarism,4.0,3-5,chemistry,501-1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5840,2.714016e+09,2.310975e+09,7423,2.147214e+09,2008,1.297744e+08,2008.0,151-200,1.082905e+08,2008.0,...,2008.0,4.0,MAGMiddleAuthor,2006-2010,other retractor,plagiarism,0.0,1,non-STEM fields,101-500
5841,2.993453e+09,1.985467e+09,4714,2.192461e+09,2012,2.394603e+07,2011.0,301-400,4.432739e+06,2011.0,...,2011.0,3.0,MAGFirstOrLastAuthor,2011-2015,journal,plagiarism,1.0,1,non-STEM fields,101-500
5842,2.993453e+09,2.169270e+09,4714,2.192461e+09,2012,2.394603e+07,2011.0,301-400,2.986251e+08,2011.0,...,2011.0,3.0,MAGFirstOrLastAuthor,2011-2015,journal,plagiarism,1.0,1,non-STEM fields,101-500
5843,2.993453e+09,3.124413e+09,4714,2.192461e+09,2012,2.394603e+07,2011.0,301-400,1.895907e+08,2011.0,...,2011.0,3.0,MAGFirstOrLastAuthor,2011-2015,journal,plagiarism,1.0,1,non-STEM fields,101-500


In [18]:
dfmatched_30_perc_wStrataVars[['MAGRetractionYearAffRank','MAGRetractionYearAffRankStratified']].head(30)

Unnamed: 0,MAGRetractionYearAffRank,MAGRetractionYearAffRankStratified
0,401-500,101-500
1,1001-,>1000
2,201-300,101-500
3,201-300,101-500
4,501-600,501-1000
5,801-900,501-1000
6,201-300,101-500
7,101-150,101-500
8,101-150,101-500
9,14,1-100


## Augmenting outcome variables

In [19]:
dfmatched_10_perc_wStrataVars.columns

Index(['MAGAID', 'MatchMAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffYear',
       'MAGRetractionYearAffRank', 'MatchMAGRetractionYearAffID',
       'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
       'MatchMAGMaxRetractionYear', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent', 'GenderizeGender',
       'MAGFirstPubYear', 'MAGFirstAffID', 'MAGFirstAffiliationRank',
       'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
       'MatchMAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MatchMAGCumPapersYearAtRetraction',
       'MatchMAGCumPapersAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsYearAtRetraction',
       'MatchMAGCumCitationsAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction',
       'MatchMAGCumCollaborat

In [20]:
# Creating treatment and control

df_treatment = dfmatched_30_perc_wStrataVars.drop(columns=['MatchMAGAID','MatchMAGRetractionYearAffID',
               'MatchMAGRetractionYearAffYear', 'MatchMAGRetractionYearAffRank',
               'MatchMAGMaxRetractionYear','MatchMAGrootFID', 'MatchMAGrootFIDMaxPercent',
                'MatchMAGCumPapersYearAtRetraction', 'MatchMAGCumPapersAtRetraction',
               'MatchMAGCumCitationsYearAtRetraction', 'MatchMAGCumCitationsAtRetraction',
               'MatchMAGCumCollaboratorsYearAtRetraction',
               'MatchMAGCumCollaboratorsAtRetraction',
                'MatchMAGFirstAffID', 'MatchMAGFirstAffYear',
               'MatchMAGFirstAffiliationRank']).drop_duplicates()

df_control = dfmatched_30_perc_wStrataVars.copy()

In [21]:
df_treatment.columns

Index(['MAGAID', 'Record ID', 'MAGPID', 'RetractionYear',
       'MAGRetractionYearAffID', 'MAGRetractionYearAffYear',
       'MAGRetractionYearAffRank', 'MAGrootFID', 'MAGrootFIDMaxPercent',
       'GenderizeGender', 'MAGFirstPubYear', 'MAGFirstAffID',
       'MAGFirstAffiliationRank', 'MAGCumPapersAtRetraction',
       'MAGCumPapersYearAtRetraction', 'MAGCumCitationsAtRetraction',
       'MAGCumCitationsYearAtRetraction', 'MAGCumCollaboratorsAtRetraction',
       'MAGCumCollaboratorsYearAtRetraction', 'MAGAIDFirstORLastAuthorFlag',
       'DemiDecadeOfRetraction', 'RetractorMajority',
       'ReasonPropagatedMajorityOfMajority', 'AcademicAgeBeforeRetraction',
       'AcademicAgeBin', 'MAGFieldName', 'MAGRetractionYearAffRankStratified'],
      dtype='object')

In [23]:
# Reading the collaborators file
df_1d_collaborators = pd.read_csv("../../data/main/RW_MAGcollaborators_1stDegree_rematching_woPapersCitationsCollaboratorsAtRetraction_wCollabYear_le2020_closestMatch30_nonattrited_matches.csv")
df_1d_collaborators

Unnamed: 0,MAGAID,ScientistType,MAGCollaborationYear,MAGCollabAID
0,2.119675e+09,retracted,1994.0,1970976640
1,2.119675e+09,retracted,1994.0,2138409761
2,2.119675e+09,retracted,1994.0,2806145691
3,2.119675e+09,retracted,1994.0,2891431513
4,2.119675e+09,retracted,1966.0,361927458
...,...,...,...,...
1255787,2.060967e+09,matched,2019.0,2943795428
1255788,2.060967e+09,matched,2020.0,2616288379
1255789,2.060967e+09,matched,2020.0,2951185787
1255790,2.060967e+09,matched,2020.0,2121088250


In [27]:
df_1d_collaborators[df_1d_collaborators.MAGCollabAID.isin(df_1d_collaborators.MAGAID.unique())]

df_1d_collaborators[df_1d_collaborators.MAGCollabAID.eq(2410335969)]

Unnamed: 0,MAGAID,ScientistType,MAGCollaborationYear,MAGCollabAID
14,2119675000.0,retracted,1998.0,2410335969
19,2119675000.0,retracted,1990.0,2410335969
34,2119675000.0,retracted,1988.0,2410335969
44,2119675000.0,retracted,1991.0,2410335969
147,2119675000.0,retracted,1997.0,2410335969
152,2119675000.0,retracted,1996.0,2410335969
195,2119675000.0,retracted,1994.0,2410335969


In [635]:
# Separating collaborators for treatment and control
df_1d_collaborators_treatment = df_1d_collaborators[df_1d_collaborators.ScientistType == 'retracted'].\
                                drop(columns=['ScientistType']).\
                                drop_duplicates() # Not necessary but still

df_1d_collaborators_control = df_1d_collaborators[df_1d_collaborators.ScientistType == 'matched'].\
                                rename(columns={'MAGAID':'MatchMAGAID'}).\
                                drop(columns=['ScientistType']).\
                                drop_duplicates() # Not necessary but still


# Let us only get collaborators for MAGAIDs that are relevant
df_1d_collaborators_treatment = df_1d_collaborators_treatment.\
                                merge(df_treatment[['MAGAID','RetractionYear']].drop_duplicates(),
                                on='MAGAID', how='right')

# Now let us augment df_1d_collaborators_control with MAGAID first
# Also only getting collaborators for matches that are useful
df_1d_collaborators_control = df_1d_collaborators_control.\
                                merge(df_control[['MAGAID','MatchMAGAID','RetractionYear']].drop_duplicates(),
                                on=['MatchMAGAID'], how='right')

df_1d_collaborators_control.shape

(583120, 5)

In [636]:
df_1d_collaborators_treatment[df_1d_collaborators_treatment.MAGCollaborationYear.isna()]

Unnamed: 0,MAGAID,MAGCollaborationYear,MAGCollabAID,RetractionYear


In [637]:
df_1d_collaborators_treatment.MAGAID.nunique(),df_1d_collaborators_control.MAGAID.nunique(),df_1d_collaborators_control.MatchMAGAID.nunique()


(3094, 3094, 4274)

### Extracting pre- and post-retraction collaborators with 5 year window

Given the assumption that retraction affects the scientists' reputation for only certain number of years, after which there is a phase of recovery, we conduct our analysis by limiting collaborations to a 5 year window such that we only look at collaborators 5 year in the past and 5 years in the future. 

**VERY important note: earlier we may be dropping authors with no collaborators pre and post. We must add them back**

In [638]:
#Let us first create a prepost flag to check if a collaborator is before or after retraction given 5 year window

def get_prepost_flag(row):
    if(pd.isna(row['MAGCollaborationYear'])):
        return 'pre'
    if(row['MAGCollaborationYear'] <= row['RetractionYear']):
        return 'pre'
    else:
        if((row['MAGCollaborationYear']-row['RetractionYear'])<=5):
            return 'post5'
        else:
            return 'post'


# Now let us apply the get_prepost_flag function to each row for treatment and control

df_1d_collaborators_treatment['PrePostFlag5'] = df_1d_collaborators_treatment.apply(lambda row: get_prepost_flag(row), 
                                                 axis=1)

df_1d_collaborators_control['PrePostFlag5'] = df_1d_collaborators_control.apply(lambda row: get_prepost_flag(row), 
                                             axis=1)

In [639]:
df_1d_collaborators_control[df_1d_collaborators_control.PrePostFlag5.isna()]


Unnamed: 0,MatchMAGAID,MAGCollaborationYear,MAGCollabAID,MAGAID,RetractionYear,PrePostFlag5


In [640]:
# Now we must first impute NaNs with 

In [641]:
# Now let us extract pre- and post-retraction collaborators as set

# Grouping by MAGAID, gender, and pre-post flag to extract pre, and post- re. collabs.
df_1d_collaborators_treatment_w5 = df_1d_collaborators_treatment.groupby(['MAGAID',
                                          'RetractionYear','PrePostFlag5'])\
                                        ['MAGCollabAID'].apply(set).unstack().reset_index().\
                                        drop(columns=['post'])

df_1d_collaborators_control_w5 = df_1d_collaborators_control.groupby(['MAGAID','MatchMAGAID','RetractionYear',
                                                                   'PrePostFlag5'])\
                                        ['MAGCollabAID'].apply(set).unstack().reset_index().\
                                        drop(columns=['post'])


In [642]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.post5.isna() & 
#                               df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)]

In [643]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre
0,8.197726e+06,2.107234e+09,2012,"{2145188037, 2306846982, 2158779783, 248809216...","{2690241248, 2344056129, 2130291684, 214518803..."
1,1.373700e+07,2.311431e+09,2014,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528..."
2,1.551904e+07,2.955374e+09,2013,"{2642045704, 2693253939, 2104920851, 2704746709}","{2693266496, 2682261761, 3079794078, 262063520..."
3,1.910029e+07,2.154843e+09,2002,"{2320217344, 2139722369, 2009431808, 271734412...","{2107031301, 2130157064, 2698065035, 246342132..."
4,2.127731e+07,7.210135e+08,2011,"{2551369984, 2031444739, 2616952581, 225338983...","{2504012547, 2304281604, 2664301957, 225338983..."
...,...,...,...,...,...
4559,3.173544e+09,2.893741e+09,2011,"{1968556036, 2097680395, 2047703565, 221441282...","{2112561540, 2308285060, 2793024262, 205354816..."
4560,3.174427e+09,2.151144e+09,2005,"{2495799300, 2281580422, 2183320844, 201263092...","{2223475336, 2499355405, 2012630926, 214343438..."
4561,3.174448e+09,1.897294e+09,2008,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635..."
4562,3.174844e+09,2.315520e+09,2014,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246..."


In [644]:

# Dealing with NaNs, and replacing them with empty set

# For treatment
df_1d_collaborators_treatment_w5['pre'] = df_1d_collaborators_treatment_w5['pre'].\
                                            apply(lambda d: d if isinstance(d, set) else set())

df_1d_collaborators_treatment_w5['post5'] = df_1d_collaborators_treatment_w5['post5'].\
                                                apply(lambda d: d if isinstance(d, set) else set())

# For control
df_1d_collaborators_control_w5['pre'] = df_1d_collaborators_control_w5['pre'].\
                                            apply(lambda d: d if isinstance(d, set) else set())
df_1d_collaborators_control_w5['post5'] = df_1d_collaborators_control_w5['post5'].\
                                            apply(lambda d: d if isinstance(d, set) else set())

In [645]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre
0,8.197726e+06,2.107234e+09,2012,"{2145188037, 2306846982, 2158779783, 248809216...","{2690241248, 2344056129, 2130291684, 214518803..."
1,1.373700e+07,2.311431e+09,2014,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528..."
2,1.551904e+07,2.955374e+09,2013,"{2642045704, 2693253939, 2104920851, 2704746709}","{2693266496, 2682261761, 3079794078, 262063520..."
3,1.910029e+07,2.154843e+09,2002,"{2320217344, 2139722369, 2009431808, 271734412...","{2107031301, 2130157064, 2698065035, 246342132..."
4,2.127731e+07,7.210135e+08,2011,"{2551369984, 2031444739, 2616952581, 225338983...","{2504012547, 2304281604, 2664301957, 225338983..."
...,...,...,...,...,...
4559,3.173544e+09,2.893741e+09,2011,"{1968556036, 2097680395, 2047703565, 221441282...","{2112561540, 2308285060, 2793024262, 205354816..."
4560,3.174427e+09,2.151144e+09,2005,"{2495799300, 2281580422, 2183320844, 201263092...","{2223475336, 2499355405, 2012630926, 214343438..."
4561,3.174448e+09,1.897294e+09,2008,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635..."
4562,3.174844e+09,2.315520e+09,2014,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246..."


### Extracting number & set of retained collaborators with a 5 year window

In [646]:
# Now let us find the number and set of retained collaborators for both the groups

df_1d_collaborators_treatment_w5['NumRetentionW5'] = df_1d_collaborators_treatment_w5.apply(lambda row: \
                                                    len(row.post5.intersection(row.pre)), 
                                                    axis=1)

df_1d_collaborators_treatment_w5['CollabAIDRetainedW5'] = df_1d_collaborators_treatment_w5.apply(lambda row: \
                                                    row.post5.intersection(row.pre), 
                                                    axis=1)

df_1d_collaborators_control_w5['NumRetentionW5'] = df_1d_collaborators_control_w5.apply(lambda row: \
                                                len(row.post5.intersection(row.pre)), 
                                                axis=1)

df_1d_collaborators_control_w5['CollabAIDRetainedW5'] = df_1d_collaborators_control_w5.apply(lambda row: \
                                                    row.post5.intersection(row.pre), 
                                                    axis=1)

In [647]:
df_1d_collaborators_treatment_w5

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5
0,8.197726e+06,2012,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}"
1,1.373700e+07,2014,"{2144558976, 2068504066, 1986642243, 146813190...","{2144558976, 1456934145, 2050002051, 146813190...",8,"{2144558976, 1986642243, 146813190, 2304798023..."
2,1.551904e+07,2013,"{2716086216, 2104789297, 67380540, 2546151959}","{2648287456, 2318811809, 2332026499, 242479555...",0,{}
3,1.910029e+07,2002,"{18011520, 2181345027, 2608247304, 2043645593,...","{2437904394, 2171993227, 2043645593, 220887567...",5,"{2128982626, 1863203661, 2043645593, 410625722..."
4,2.127731e+07,2011,"{2172958848, 1966109318, 1809354124, 302322228...","{1966109318, 2097219847, 2075459985, 11373586,...",6,"{1321577409, 163718210, 1941207651, 1966109318..."
...,...,...,...,...,...,...
3089,3.173544e+09,2011,"{3120802076, 2236143686}","{2096029443, 2160112133, 2158121610, 230907623...",0,{}
3090,3.174427e+09,2005,"{2107455647, 70839488, 3152002981, 2165407721,...","{2713650945, 2444308994, 2125822852, 271476378...",5,"{3152002981, 2309432301, 2109867797, 277857144..."
3091,3.174448e+09,2008,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",1,{2561941943}
3092,3.174844e+09,2014,"{2636262617, 550125002, 2295148299, 2265510894...","{2954067065, 2311908582, 2005715177, 550125002...",3,"{2174600848, 2636262617, 550125002}"


In [648]:
df_1d_collaborators_control_w5

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5
0,8.197726e+06,2.107234e+09,2012,"{2145188037, 2306846982, 2158779783, 248809216...","{2690241248, 2344056129, 2130291684, 214518803...",2,"{2145188037, 93643413}"
1,1.373700e+07,2.311431e+09,2014,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528...",1,{152283727}
2,1.551904e+07,2.955374e+09,2013,"{2642045704, 2693253939, 2104920851, 2704746709}","{2693266496, 2682261761, 3079794078, 262063520...",0,{}
3,1.910029e+07,2.154843e+09,2002,"{2320217344, 2139722369, 2009431808, 271734412...","{2107031301, 2130157064, 2698065035, 246342132...",13,"{2318236832, 2107031301, 2130157064, 230507415..."
4,2.127731e+07,7.210135e+08,2011,"{2551369984, 2031444739, 2616952581, 225338983...","{2504012547, 2304281604, 2664301957, 225338983...",11,"{2461702755, 2135576292, 2253389830, 258145018..."
...,...,...,...,...,...,...,...
4559,3.173544e+09,2.893741e+09,2011,"{1968556036, 2097680395, 2047703565, 221441282...","{2112561540, 2308285060, 2793024262, 205354816...",16,"{1306850400, 2237788834, 1570332770, 230828506..."
4560,3.174427e+09,2.151144e+09,2005,"{2495799300, 2281580422, 2183320844, 201263092...","{2223475336, 2499355405, 2012630926, 214343438...",17,"{2233600099, 2511887269, 2530793961, 220429486..."
4561,3.174448e+09,1.897294e+09,2008,{2074793029},"{2594844899, 2074793029, 2067866536, 266878635...",1,{2074793029}
4562,3.174844e+09,2.315520e+09,2014,"{2310608896, 2991788804, 2948262277, 205797632...","{2674099783, 2164723530, 621341675, 2297775246...",6,"{2674099783, 2297775246, 2133515217, 279778914..."


In [649]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)].NumRetentionW5.value_counts()

### Extracting number & set of new collaborators with a 5 year window

In [650]:
# Now let us compute the number of new collaborators

# We can compute them by subtracting pre-retraction collaborators' set from post-retraction collaborators' set
def extract_num_newCollab(row):
    return len(row['post5']-row['pre'])

def extract_newCollab(row):
    return row['post5']-row['pre']

# computing number and set of new collaborators
df_1d_collaborators_treatment_w5['NumNewCollaboratorsW5'] = df_1d_collaborators_treatment_w5\
                                                            .apply(lambda row: extract_num_newCollab(row), 
                                                               axis=1)

df_1d_collaborators_treatment_w5['CollabAIDGainedW5'] = df_1d_collaborators_treatment_w5\
                                                            .apply(lambda row: extract_newCollab(row), 
                                                               axis=1)

df_1d_collaborators_control_w5['NumNewCollaboratorsW5'] = df_1d_collaborators_control_w5.apply(lambda row: extract_num_newCollab(row), 
                                                       axis=1)

df_1d_collaborators_control_w5['CollabAIDGainedW5'] = df_1d_collaborators_control_w5.apply(lambda row: extract_newCollab(row), 
                                                           axis=1)



In [651]:
df_1d_collaborators_treatment_w5.head(2)

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,8197726.0,2012,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}",10,"{1969204096, 2226225926, 1993638631, 256742932..."
1,13737004.0,2014,"{2144558976, 2068504066, 1986642243, 146813190...","{2144558976, 1456934145, 2050002051, 146813190...",8,"{2144558976, 1986642243, 146813190, 2304798023...",8,"{2068504066, 1526526887, 1529129866, 207583934..."


In [652]:
df_1d_collaborators_control_w5.head(2)

PrePostFlag5,MAGAID,MatchMAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,8197726.0,2107234000.0,2012,"{2145188037, 2306846982, 2158779783, 248809216...","{2690241248, 2344056129, 2130291684, 214518803...",2,"{2145188037, 93643413}",9,"{2306846982, 2158779783, 2488092168, 269472638..."
1,13737004.0,2311431000.0,2014,"{2026366049, 2103074534, 160984200, 1973807465...","{2675909761, 2054385801, 2109105035, 212543528...",1,{152283727},12,"{2026366049, 2103074534, 160984200, 2502500713..."


### Merging num retention and num new collaborators with treatment and control

In [653]:

df_treatment_augmented = df_treatment.\
                merge(df_1d_collaborators_treatment_w5[['MAGAID','NumRetentionW5',
                                                        'NumNewCollaboratorsW5']].drop_duplicates(),
                     on='MAGAID')

df_treatment_augmented

Unnamed: 0,MAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MAGrootFID,MAGrootFIDMaxPercent,GenderizeGender,...,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5
0,8.197726e+06,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,185592680.0,0.333333,female,...,MAGMiddleAuthor,2011-2015,author,mistake,5.0,3-5,chemistry,101-500,4,10
1,2.127731e+07,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,185592680.0,0.388889,male,...,MAGFirstOrLastAuthor,2011-2015,author,mistake,16.0,>5,chemistry,>1000,6,14
2,3.343381e+07,23728,2.055923e+09,2009,1.733049e+08,2009.0,201-300,185592680.0,0.600000,male,...,MAGFirstOrLastAuthor,2006-2010,author,mistake,2.0,2,chemistry,101-500,4,0
3,6.373615e+07,4317,2.166194e+09,2009,1.733049e+08,2009.0,201-300,185592680.0,0.194444,female,...,MAGMiddleAuthor,2006-2010,other retractor,mistake,2.0,2,chemistry,101-500,7,29
4,7.092528e+07,8100,2.068127e+09,2014,1.582483e+08,2014.0,501-600,185592680.0,0.196970,male,...,MAGFirstOrLastAuthor,2011-2015,other retractor,plagiarism,4.0,3-5,chemistry,501-1000,4,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3677,2.159512e+09,1987,2.059315e+09,2015,9.014989e+07,2015.0,1001-,127313418.0,0.163265,male,...,MAGMiddleAuthor,2011-2015,other retractor,plagiarism,5.0,3-5,non-STEM fields,>1000,9,12
3678,2.160882e+09,16554,2.014362e+09,2014,9.360294e+06,2014.0,201-300,127313418.0,0.333333,male,...,MAGMiddleAuthor,2011-2015,other retractor,misconduct,6.0,>5,non-STEM fields,101-500,8,12
3679,2.162518e+09,8543,2.121510e+09,2013,6.690620e+07,2012.0,901-1000,127313418.0,0.230769,female,...,MAGMiddleAuthor,2011-2015,other retractor,other,3.0,3-5,non-STEM fields,501-1000,5,13
3680,2.195114e+09,18557,2.045810e+09,2008,2.986251e+08,2008.0,301-400,127313418.0,0.285714,male,...,MAGMiddleAuthor,2006-2010,journal,misconduct,3.0,3-5,non-STEM fields,101-500,1,4


In [654]:
df_control_augmented = df_control.\
                merge(df_1d_collaborators_control_w5[['MAGAID','MatchMAGAID','NumRetentionW5',
                                                        'NumNewCollaboratorsW5']].drop_duplicates(),
                     on=['MAGAID','MatchMAGAID'])

df_control_augmented

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,MAGAIDFirstORLastAuthorFlag,DemiDecadeOfRetraction,RetractorMajority,ReasonPropagatedMajorityOfMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,6.973702e+07,2012.0,...,MAGMiddleAuthor,2011-2015,author,mistake,5.0,3-5,chemistry,101-500,2,9
1,2.127731e+07,7.210135e+08,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,1.294672e+09,2009.0,...,MAGFirstOrLastAuthor,2011-2015,author,mistake,16.0,>5,chemistry,>1000,11,39
2,3.343381e+07,2.135651e+09,23728,2.055923e+09,2009,1.733049e+08,2009.0,201-300,1.217483e+08,2009.0,...,MAGFirstOrLastAuthor,2006-2010,author,mistake,2.0,2,chemistry,101-500,6,14
3,6.373615e+07,2.486812e+09,4317,2.166194e+09,2009,1.733049e+08,2009.0,201-300,1.687197e+08,2009.0,...,MAGMiddleAuthor,2006-2010,other retractor,mistake,2.0,2,chemistry,101-500,2,2
4,7.092528e+07,2.949159e+09,8100,2.068127e+09,2014,1.582483e+08,2014.0,501-600,2.007630e+08,2014.0,...,MAGFirstOrLastAuthor,2011-2015,other retractor,plagiarism,4.0,3-5,chemistry,501-1000,10,62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5840,2.714016e+09,3.168352e+09,7423,2.147214e+09,2008,1.297744e+08,2008.0,151-200,2.049832e+08,2008.0,...,MAGMiddleAuthor,2006-2010,other retractor,plagiarism,0.0,1,non-STEM fields,101-500,2,3
5841,2.947074e+09,3.019838e+08,16554,2.014362e+09,2014,3.202198e+07,2014.0,201-300,1.386896e+08,2014.0,...,MAGFirstOrLastAuthor,2011-2015,other retractor,misconduct,21.0,>5,non-STEM fields,101-500,41,141
5842,2.501052e+09,2.149938e+09,18635,2.081956e+09,2012,1.912085e+08,2012.0,201-300,1.628687e+08,2011.0,...,MAGMiddleAuthor,2011-2015,other retractor,other,1.0,1,non-STEM fields,101-500,5,5
5843,2.993453e+09,2.169270e+09,4714,2.192461e+09,2012,2.394603e+07,2011.0,301-400,2.986251e+08,2011.0,...,MAGFirstOrLastAuthor,2011-2015,journal,plagiarism,1.0,1,non-STEM fields,101-500,2,8


In [655]:
# df_1d_collaborators_control_w5[df_1d_collaborators_control_w5.MatchMAGAID.isin(temp)].NumNewCollaboratorsW5.value_counts()

### Triadic Closure

In [29]:
# Let us first read the triadic closure files

triadic_closure_path = '/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_effects_on_collaboration_networks/data/main/triadicClosure/rematching/nonattrited_matches/'
import os

flist = os.listdir(triadic_closure_path)

dfs = []
for f in flist:
    df = pd.read_csv(triadic_closure_path+f, usecols=['MAGAID','NumOpenTriads','RetractionYear','NumTriadsClosed','NC'])
    dfs.append(df)
    
df_triads = pd.concat(dfs)

df_triads.head()


Unnamed: 0,MAGAID,CollabAIDGainedW5,pre,RetractionYear,pre_2D,pre_2D_wo_1D,TriadsClosed,NumOpenTriads,NumTriadsClosed,NC
0,2642974000.0,{2947813127},"{2302816209, 380396913, 2704301283, 2171151900}",2014,"{2306037252, 2558065160, 2302771209, 182908468...","{2306037252, 2558065160, 2302771209, 182908468...",set(),126,0,0.0
1,2643290000.0,"{2052262048, 1212737670, 2585816806, 213564372...","{2083140866, 2714291460, 2660419465, 176157044...",2015,"{2083140866, 2128399107, 2714291460, 180161446...","{2128399107, 2719671305, 2148987658, 217486746...",set(),66,0,0.0
2,2643319000.0,"{2510228103, 1255027345, 2145724305, 217042626...","{2892668162, 2115960963, 2130692996, 214375718...",2013,"{2335897603, 2893752323, 2215921669, 279863040...","{2893752323, 2335897603, 2798630404, 221592166...",set(),262,0,0.0
3,2643339000.0,set(),"{2435970729, 2555864395, 1999676908, 244266199...",1996,"{2146113537, 2032271362, 2397126660, 215410689...","{2146113537, 2032271362, 2637216770, 239712666...",set(),462,0,0.0
4,2643419000.0,set(),"{2694046386, 2487858741}",2015,"{2643418848, 2694046386}",{2643418848},set(),1,0,0.0


0       {2302816209, 380396913, 2704301283, 2171151900}
1     {2083140866, 2714291460, 2660419465, 176157044...
2     {2892668162, 2115960963, 2130692996, 214375718...
3     {2435970729, 2555864395, 1999676908, 244266199...
4                              {2694046386, 2487858741}
                            ...                        
73    {2559878403, 2689813638, 2224876938, 230560935...
74    {2303152128, 3053834756, 3176056334, 246618574...
75    {2388327812, 2204176390, 2633631111, 305493902...
76    {2129257346, 261591564, 2171076502, 2169281304...
77    {2127752524, 2443649074, 2580435315, 258202738...
Name: pre, Length: 7796, dtype: object

In [657]:
df_triads.MAGAID.nunique() # because it contains matches too

7783

In [658]:
df_treatment_augmented = df_treatment_augmented.merge(df_triads, on='MAGAID', how='left')
print(df_treatment_augmented[df_treatment_augmented.NC.isna()].MAGAID.nunique())
df_treatment_augmented['NC'] = df_treatment_augmented['NC'].fillna(0)


df_control_augmented = df_control_augmented.merge(df_triads.rename(columns={'MAGAID':'MatchMAGAID'}), 
                                                                    on=['MatchMAGAID','RetractionYear'], how='left')
                                                  
print(df_control_augmented[df_control_augmented.NC.isna()].MatchMAGAID.nunique())

df_control_augmented['NC'] = df_control_augmented['NC'].fillna(0)

1
4


In [659]:
df_control_augmented

Unnamed: 0,MAGAID,MatchMAGAID,Record ID,MAGPID,RetractionYear,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MatchMAGRetractionYearAffID,MatchMAGRetractionYearAffYear,...,ReasonPropagatedMajorityOfMajority,AcademicAgeBeforeRetraction,AcademicAgeBin,MAGFieldName,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5,NumOpenTriads,NumTriadsClosed,NC
0,8.197726e+06,2.107234e+09,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,6.973702e+07,2012.0,...,mistake,5.0,3-5,chemistry,101-500,2,9,43,0,0.000000
1,2.127731e+07,7.210135e+08,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,1.294672e+09,2009.0,...,mistake,16.0,>5,chemistry,>1000,11,39,109,0,0.000000
2,3.343381e+07,2.135651e+09,23728,2.055923e+09,2009,1.733049e+08,2009.0,201-300,1.217483e+08,2009.0,...,mistake,2.0,2,chemistry,101-500,6,14,53,0,0.000000
3,6.373615e+07,2.486812e+09,4317,2.166194e+09,2009,1.733049e+08,2009.0,201-300,1.687197e+08,2009.0,...,mistake,2.0,2,chemistry,101-500,2,2,60,0,0.000000
4,7.092528e+07,2.949159e+09,8100,2.068127e+09,2014,1.582483e+08,2014.0,501-600,2.007630e+08,2014.0,...,plagiarism,4.0,3-5,chemistry,501-1000,10,62,862,3,0.003480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5840,2.714016e+09,3.168352e+09,7423,2.147214e+09,2008,1.297744e+08,2008.0,151-200,2.049832e+08,2008.0,...,plagiarism,0.0,1,non-STEM fields,101-500,2,3,7,2,0.285714
5841,2.947074e+09,3.019838e+08,16554,2.014362e+09,2014,3.202198e+07,2014.0,201-300,1.386896e+08,2014.0,...,misconduct,21.0,>5,non-STEM fields,101-500,41,141,202,0,0.000000
5842,2.501052e+09,2.149938e+09,18635,2.081956e+09,2012,1.912085e+08,2012.0,201-300,1.628687e+08,2011.0,...,other,1.0,1,non-STEM fields,101-500,5,5,30,3,0.100000
5843,2.993453e+09,2.169270e+09,4714,2.192461e+09,2012,2.394603e+07,2011.0,301-400,2.986251e+08,2011.0,...,plagiarism,1.0,1,non-STEM fields,101-500,2,8,50,0,0.000000


In [660]:
df_treatment_augmented.to_csv("../../data/main/RWMAG_rematched_treatment_augmented_rematching_30perc_290523.csv", index=False)
df_control_augmented.to_csv("../../data/main/RWMAG_rematched_control_augmented_rematching_30perc_290523.csv", index=False)



In [452]:
df_control_augmented.MatchMAGAID.nunique()

4274

In [453]:
df_control_augmented.MAGAID.nunique()

3094

In [454]:
df_treatment_augmented.MAGAID.nunique()

3094

In [459]:
df_treatment_augmented

Unnamed: 0,MAGAID,Record ID,MAGPID,RetractionYear_x,MAGRetractionYearAffID,MAGRetractionYearAffYear,MAGRetractionYearAffRank,MAGrootFID,MAGrootFIDMaxPercent,GenderizeGender,...,MAGRetractionYearAffRankStratified,NumRetentionW5,NumNewCollaboratorsW5,NumOpenTriads_x,NumTriadsClosed_x,NC_x,RetractionYear_y,NumOpenTriads_y,NumTriadsClosed_y,NC_y
0,8.197726e+06,3444,1.506358e+08,2012,1.395588e+07,2012.0,401-500,185592680.0,0.333333,female,...,101-500,4,10,4.0,1.0,0.250000,2012.0,4.0,1.0,0.250000
1,2.127731e+07,21911,2.949847e+09,2011,1.294672e+09,2011.0,1001-,185592680.0,0.388889,male,...,>1000,6,14,26.0,0.0,0.000000,2011.0,26.0,0.0,0.000000
2,3.343381e+07,23728,2.055923e+09,2009,1.733049e+08,2009.0,201-300,185592680.0,0.600000,male,...,101-500,4,0,5.0,0.0,0.000000,2009.0,5.0,0.0,0.000000
3,6.373615e+07,4317,2.166194e+09,2009,1.733049e+08,2009.0,201-300,185592680.0,0.194444,female,...,101-500,7,29,29.0,0.0,0.000000,2009.0,29.0,0.0,0.000000
4,7.092528e+07,8100,2.068127e+09,2014,1.582483e+08,2014.0,501-600,185592680.0,0.196970,male,...,501-1000,4,12,130.0,0.0,0.000000,2014.0,130.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3677,2.159512e+09,1987,2.059315e+09,2015,9.014989e+07,2015.0,1001-,127313418.0,0.163265,male,...,>1000,9,12,79.0,0.0,0.000000,2015.0,79.0,0.0,0.000000
3678,2.160882e+09,16554,2.014362e+09,2014,9.360294e+06,2014.0,201-300,127313418.0,0.333333,male,...,101-500,8,12,3.0,0.0,0.000000,2014.0,3.0,0.0,0.000000
3679,2.162518e+09,8543,2.121510e+09,2013,6.690620e+07,2012.0,901-1000,127313418.0,0.230769,female,...,501-1000,5,13,1.0,0.0,0.000000,2013.0,1.0,0.0,0.000000
3680,2.195114e+09,18557,2.045810e+09,2008,2.986251e+08,2008.0,301-400,127313418.0,0.285714,male,...,101-500,1,4,20.0,1.0,0.050000,2008.0,20.0,1.0,0.050000


In [501]:
df_control_augmented[['MAGAID','MatchMAGAID']].drop_duplicates().\
    groupby('MatchMAGAID')['MAGAID'].nunique().gt(1).reset_index().MatchMAGAID.astype(str)

0           345519.0
1          1936235.0
2          4830308.0
3          8411596.0
4         15464380.0
            ...     
4269    3172872197.0
4270    3173994760.0
4271    3174239158.0
4272    3174995368.0
4273    3176661151.0
Name: MatchMAGAID, Length: 4274, dtype: object

In [503]:
df_control_augmented[df_control_augmented.MatchMAGAID.eq(1936235)]['RetractionYear']

1951    2006
1960    2006
1975    2012
Name: RetractionYear, dtype: int64

In [302]:
df_treatment_augmented.MAGAIDFirstORLastAuthorFlag

0            MAGMiddleAuthor
1            MAGMiddleAuthor
2            MAGMiddleAuthor
3            MAGMiddleAuthor
4       MAGFirstOrLastAuthor
                ...         
4976    MAGFirstOrLastAuthor
4977    MAGFirstOrLastAuthor
4978    MAGFirstOrLastAuthor
4979    MAGFirstOrLastAuthor
4980    MAGFirstOrLastAuthor
Name: MAGAIDFirstORLastAuthorFlag, Length: 4981, dtype: object