# Characterizing Collaborators

In this notebook, we shall characterize collaborators. We shall do that in the following way:

There are three reasons: misconduct, plagiarism, and mistake.

For each academic age group within retracted and matched scientists **at the time of retraction**, we shall conduct three analysis and create three tables:

#### Retained for retracted vs. matched
1. Table 1 comparing the **retained** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Gained for retracted vs. matched
2. Table 2 comparing the **gained/new** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Retained vs. lost for retracted vs. matched
3. Table 3 comparing the **retained** collaborators of retracted and matched scientists to those **lost** in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of retraction**. The table will be produced by difference in differences approach where we shall first compute the averages for each field (papers, citations, etc.) for retained and lost for retracted and matched. Then we shall compute the difference between retained for retracted and matched, and between lost for retracted and matched. Finally we shall take the difference in difference (DiD) i.e. **RETAINED-LOST**. The table will also contain median, standard deviation, and p-value for t-test.



In [1]:
import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats



In [2]:
INDIR = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/"
INDIR_MATCHING = INDIR+"/author_matching/"
INDIR_COLLAB = INDIR+"/collaborator_quality_analysis/"

df = pd.read_csv(INDIR_COLLAB+"/1Dcollaborators_for_matched_sample_30.csv")

print(df.shape)

df.head()

(773033, 20)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,CollabMAGCumCitationsYearAtRetraction,CollabMAGCumCitationsAtRetraction,CollabMAGCumCollaboratorsYearAtRetraction,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,1994.0,3683.0,1994.0,83.0,1983.0,47.0,1983.0,1076.0,1983.0,46
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1983.0,47.0,1983.0,1305.0,1983.0,37
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,1994.0,532.0,1983.0,18.0,1983.0,10.0,1983.0,199.0,1983.0,18
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1992.0,74.0,1992.0,2449.0,1992.0,71
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,1994.0,136.0,1993.0,31.0,1992.0,14.0,1992.0,90.0,1992.0,27


In [3]:
print(df.shape)

(773033, 20)


In [4]:
df.MAGCollabAID.nunique()

411911

### Preprocessing

In [5]:
# Let us first augment the academic age of MAGAIDs. We will also add other columns to be used later

# Reading files used for matching

df_treatment = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','ReasonPropagatedMajorityOfMajority'])\
                    .drop_duplicates()\
                    .rename(columns={
                                    'AcademicAgeBeforeRetraction': 'AcademicAgeAtRetraction'})

df_control = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MatchMAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','ReasonPropagatedMajorityOfMajority'])\
                    .drop_duplicates()\
                    .rename(columns={'MatchMAGAID':'MAGAID'})

df_treatment_control = pd.concat([df_treatment,df_control])

# Removing reasons that are not misconduct, plagiarism, mistake

df_treatment_control = df_treatment_control[df_treatment_control.\
                                        ReasonPropagatedMajorityOfMajority.isin(['misconduct',
                                                                                'plagiarism',
                                                                                'mistake'])]

df_treatment_control

Unnamed: 0,MAGAID,RetractionYear,ReasonPropagatedMajorityOfMajority,NumRetentionW5,NumNewCollaboratorsW5
1,8.197726e+06,2012.0,mistake,4,10
3,9.474215e+06,2015.0,mistake,55,481
4,1.373700e+07,2014.0,mistake,8,8
5,1.551904e+07,2013.0,plagiarism,0,4
6,4.757012e+07,2015.0,misconduct,5,19
...,...,...,...,...,...
5411,2.127710e+09,2015.0,plagiarism,3,25
5412,1.974243e+09,2013.0,mistake,3,2
5414,1.933448e+09,2014.0,misconduct,10,36
5416,2.077873e+09,2008.0,misconduct,2,3


In [6]:
# Merging that with df

df2 = df.merge(df_treatment_control.drop(columns=['NumRetentionW5','NumNewCollaboratorsW5']), 
                                         on=['MAGAID','RetractionYear'])
df2

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumCitationsAtRetraction,CollabMAGCumCollaboratorsYearAtRetraction,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,ReasonPropagatedMajorityOfMajority
0,2.105038e+09,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,3683.0,1994.0,83.0,1983.0,47.0,1983.0,1076.0,1983.0,46,mistake
1,2.105038e+09,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,2668.0,1994.0,81.0,1983.0,47.0,1983.0,1305.0,1983.0,37,mistake
2,2.105038e+09,2486043001,1994.0,1983.0,retracted,male,0.60,1971.0,1983.0,10.0,...,532.0,1983.0,18.0,1983.0,10.0,1983.0,199.0,1983.0,18,mistake
3,2.105038e+09,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,2668.0,1994.0,81.0,1992.0,74.0,1992.0,2449.0,1992.0,71,mistake
4,2.105038e+09,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,136.0,1993.0,31.0,1992.0,14.0,1992.0,90.0,1992.0,27,mistake
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
626059,2.294600e+09,2972696792,2012.0,2020.0,matched,male,0.99,2019.0,,0.0,...,0.0,,0.0,2020.0,3.0,2020.0,1.0,2020.0,17,plagiarism
626060,2.294600e+09,3111842016,2012.0,2020.0,matched,female,0.98,2020.0,,0.0,...,0.0,,0.0,2020.0,1.0,,0.0,2020.0,7,plagiarism
626061,2.294600e+09,3112134165,2012.0,2020.0,matched,female,0.99,2020.0,,0.0,...,0.0,,0.0,2020.0,1.0,,0.0,2020.0,7,plagiarism
626062,2.294600e+09,3112412017,2012.0,2020.0,matched,female,0.97,2020.0,,0.0,...,0.0,,0.0,2020.0,1.0,,0.0,2020.0,7,plagiarism


In [7]:
# Let us first compute academic age at retraction and at collaboration for collaborators
df2['CollabAcademicAgeAtRetraction'] = df2['RetractionYear']-df2['CollabMAGFirstPubYear']

df2['CollabAcademicAgeAtCollaboration'] = df2['MAGCollaborationYear']-df2['CollabMAGFirstPubYear']

# So negatives are possible in academic age at retraction but not collaboration
df2.CollabAcademicAgeAtRetraction.describe()

count    626064.000000
mean          9.744611
std          13.963814
min         -30.000000
25%           0.000000
50%           7.000000
75%          17.000000
max         215.000000
Name: CollabAcademicAgeAtRetraction, dtype: float64

In [8]:
# Let us first identify if the collaboration was pre- or post-retraction

def get_prepost_flag(row):
    if(row['MAGCollaborationYear'] <= row['RetractionYear']):
        return 'pre'
    else:
        if((row['MAGCollaborationYear']-row['RetractionYear'])<=5):
            return 'post5'
        else:
            return 'post'

df2['PrePostFlag5'] = df2.apply(lambda row: get_prepost_flag(row), axis=1)

In [9]:
# Let us remove the collaborators that are "post"

df3 = df2[~df2.PrePostFlag5.eq('post')]


In [10]:
df3.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,ReasonPropagatedMajorityOfMajority,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,1983.0,47.0,1983.0,1076.0,1983.0,46,mistake,27.0,16.0,pre
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1983.0,47.0,1983.0,1305.0,1983.0,37,mistake,30.0,19.0,pre
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,1983.0,10.0,1983.0,199.0,1983.0,18,mistake,23.0,12.0,pre
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1992.0,74.0,1992.0,2449.0,1992.0,71,mistake,30.0,28.0,pre
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,1992.0,14.0,1992.0,90.0,1992.0,27,mistake,10.0,8.0,pre


In [11]:
df3.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'ReasonPropagatedMajorityOfMajority', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5'],
      dtype='object')

In [12]:
# For each MAGAID, let us create a set of collaborators pre- and post- retraction

df4 = df3.groupby(['MAGAID','RetractionYear','PrePostFlag5'])\
                        ['MAGCollabAID'].apply(set).unstack().reset_index()


# Converting pre- and post5 columns to set so we can do set operations
df4['pre'] = df4['pre'].apply(lambda d: d if isinstance(d, set) else set())
df4['post5'] = df4['post5'].apply(lambda d: d if isinstance(d, set) else set())


# COLLABORATOR RETENTION

# Computing number of collaborators retained
df4['NumRetentionW5'] = df4.apply(lambda row: len(row.post5.intersection(row.pre)), 
                            axis=1)

# Creating the list of collaborators retained
df4['CollabAIDRetainedW5'] = df4.apply(lambda row: row.post5.intersection(row.pre), 
                                                    axis=1)


# Creating list of collaborators lost
df4['CollabAIDLostW5'] = df4.apply(lambda row: row['pre'] - row['CollabAIDRetainedW5'], 
                                                    axis=1)


# COLLABORATOR GAIN

# Computing number of collabortors gained
df4['NumNewCollaboratorsW5'] = df4.apply(lambda row: len(row['post5']-row['pre']), 
                                                    axis=1)

# Creating set of collaborators gained
df4['CollabAIDGainedW5'] = df4.apply(lambda row: row['post5']-row['pre'], 
                                                    axis=1)


df4.head()

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,5440459.0,2010.0,"{2084557827, 2402421397, 2235672854, 213407784...","{2132186624, 2084557827, 3130870283, 222369435...",13,"{2680186915, 2084557827, 2144828135, 97975655,...","{2132186624, 3130870283, 2223694350, 215816449...",35,"{2134077845, 2402421397, 2235672854, 217199132..."
1,8197726.0,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}","{279977545, 1778388943, 2506500849, 343947537,...",10,"{1969204096, 2226225926, 1993638631, 256742932..."
2,8227037.0,2003.0,"{2438245003, 2566236814, 2099366676, 186420623...","{2119475842, 2103663625, 3037713418, 276189799...",6,"{2698681252, 2464725413, 2135649512, 251608811...","{2119475842, 2103663625, 3037713418, 276189799...",20,"{2438245003, 1513362301, 2566236814, 209936667..."
3,9474215.0,2015.0,"{2741493764, 1134876678, 2589376522, 288503604...","{2102010368, 2147699201, 2402001923, 277873818...",55,"{2102010368, 2147699201, 2402001923, 211700557...","{2613431302, 1578585096, 2047308816, 216607540...",481,"{2741493764, 2589376522, 2885036046, 297136949..."
4,13737004.0,2014.0,"{2144558976, 2068504066, 1986642243, 146813190...","{2144558976, 1456934145, 2050002051, 146813190...",8,"{2144558976, 1986642243, 146813190, 2304798023...","{1456934145, 2050002051, 2672525068, 279086184...",8,"{2068504066, 1526526887, 1529129866, 207583934..."


### Validation of the number of collaborators retained and gained 

We shall validate if the numbers we calculated now match the ones on which matching was done.

In [13]:
# Merging
dfvalidation = df4[['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5']].drop_duplicates().\
                    merge(df_treatment_control, on=['MAGAID','RetractionYear'])

dfvalidation

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,ReasonPropagatedMajorityOfMajority,NumRetentionW5_y,NumNewCollaboratorsW5_y
0,5.440459e+06,2010.0,13,35,plagiarism,13,35
1,8.197726e+06,2012.0,4,10,mistake,4,10
2,8.227037e+06,2003.0,6,20,mistake,6,20
3,9.474215e+06,2015.0,55,481,mistake,55,481
4,1.373700e+07,2014.0,8,8,mistake,8,8
...,...,...,...,...,...,...,...
3838,3.174124e+09,2004.0,1,6,mistake,1,6
3839,3.174844e+09,2014.0,3,3,misconduct,3,3
3840,3.175436e+09,2015.0,1,14,plagiarism,1,14
3841,3.176126e+09,2004.0,4,1,mistake,4,1


In [14]:
# Finally validating

dfvalidation[(dfvalidation.NumRetentionW5_x == dfvalidation.NumRetentionW5_y) & 
            (dfvalidation.NumNewCollaboratorsW5_x == dfvalidation.NumNewCollaboratorsW5_y)]

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,ReasonPropagatedMajorityOfMajority,NumRetentionW5_y,NumNewCollaboratorsW5_y
0,5.440459e+06,2010.0,13,35,plagiarism,13,35
1,8.197726e+06,2012.0,4,10,mistake,4,10
2,8.227037e+06,2003.0,6,20,mistake,6,20
3,9.474215e+06,2015.0,55,481,mistake,55,481
4,1.373700e+07,2014.0,8,8,mistake,8,8
...,...,...,...,...,...,...,...
3838,3.174124e+09,2004.0,1,6,mistake,1,6
3839,3.174844e+09,2014.0,3,3,misconduct,3,3
3840,3.175436e+09,2015.0,1,14,plagiarism,1,14
3841,3.176126e+09,2004.0,4,1,mistake,4,1


**Hence all of them are validated.**

## Analysis

In [15]:
# Our main dataframes are df3 and df4
# Let us look at them first
print(df3.shape)
df3.head()

(464191, 24)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,ReasonPropagatedMajorityOfMajority,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,1983.0,47.0,1983.0,1076.0,1983.0,46,mistake,27.0,16.0,pre
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1983.0,47.0,1983.0,1305.0,1983.0,37,mistake,30.0,19.0,pre
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,1983.0,10.0,1983.0,199.0,1983.0,18,mistake,23.0,12.0,pre
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1992.0,74.0,1992.0,2449.0,1992.0,71,mistake,30.0,28.0,pre
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,1992.0,14.0,1992.0,90.0,1992.0,27,mistake,10.0,8.0,pre


In [16]:
df4

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,5.440459e+06,2010.0,"{2084557827, 2402421397, 2235672854, 213407784...","{2132186624, 2084557827, 3130870283, 222369435...",13,"{2680186915, 2084557827, 2144828135, 97975655,...","{2132186624, 3130870283, 2223694350, 215816449...",35,"{2134077845, 2402421397, 2235672854, 217199132..."
1,8.197726e+06,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}","{279977545, 1778388943, 2506500849, 343947537,...",10,"{1969204096, 2226225926, 1993638631, 256742932..."
2,8.227037e+06,2003.0,"{2438245003, 2566236814, 2099366676, 186420623...","{2119475842, 2103663625, 3037713418, 276189799...",6,"{2698681252, 2464725413, 2135649512, 251608811...","{2119475842, 2103663625, 3037713418, 276189799...",20,"{2438245003, 1513362301, 2566236814, 209936667..."
3,9.474215e+06,2015.0,"{2741493764, 1134876678, 2589376522, 288503604...","{2102010368, 2147699201, 2402001923, 277873818...",55,"{2102010368, 2147699201, 2402001923, 211700557...","{2613431302, 1578585096, 2047308816, 216607540...",481,"{2741493764, 2589376522, 2885036046, 297136949..."
4,1.373700e+07,2014.0,"{2144558976, 2068504066, 1986642243, 146813190...","{2144558976, 1456934145, 2050002051, 146813190...",8,"{2144558976, 1986642243, 146813190, 2304798023...","{1456934145, 2050002051, 2672525068, 279086184...",8,"{2068504066, 1526526887, 1529129866, 207583934..."
...,...,...,...,...,...,...,...,...,...
3838,3.174124e+09,2004.0,"{2706053123, 2250669861, 2568589385, 222924385...","{2939265617, 2687883010, 2424699715, 2100866894}",1,{2100866894},"{2939265617, 2687883010, 2424699715}",6,"{2706053123, 2250669861, 2568589385, 222924385..."
3839,3.174844e+09,2014.0,"{2636262617, 550125002, 2295148299, 2265510894...","{2954067065, 2311908582, 2005715177, 550125002...",3,"{2174600848, 2636262617, 550125002}","{2311908582, 2005715177, 2124843274, 250398225...",3,"{1494968409, 2295148299, 2265510894}"
3840,3.175436e+09,2015.0,"{1805786912, 2999619457, 2658197410, 257933920...","{2130470407, 2395301650, 1455333013, 231293572...",1,{2121913688},"{2240552385, 2130470407, 2333910471, 252033530...",14,"{1805786912, 2999619457, 2658197410, 196858720..."
3841,3.176126e+09,2004.0,"{1986848736, 2166182598, 2098417261, 264898482...","{2509432581, 2132746631, 2105761291, 204507009...",4,"{1986848736, 2477984924, 2648984820, 2166182598}","{2780665794, 2117749316, 2509432581, 213274663...",1,{2098417261}


In [17]:
# Let us first merge df3 and df4

df_A = df3.merge(df4, on=['MAGAID','RetractionYear'])

# Let us also create three flags checking whether current collaborator is retained, gained, or lost

df_A['CollabAIDinRetained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDRetainedW5'], 
                                          axis=1)

df_A['CollabAIDinGained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDGainedW5'], 
                                          axis=1)

df_A['CollabAIDinLost'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDLostW5'], 
                                          axis=1)

df_A.head(3)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,False,True
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",True,False,False
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,False,True


In [18]:
# Sensibility checks

df_A[['CollabAIDinRetained','CollabAIDinGained','CollabAIDinLost']].value_counts()

CollabAIDinRetained  CollabAIDinGained  CollabAIDinLost
False                False              True               191468
                     True               False              154178
True                 False              False              118545
Name: count, dtype: int64

In [19]:
df_A.columns, df_A.shape

(Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
        'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
        'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
        'CollabMAGCumPapersAtRetraction',
        'CollabMAGCumCitationsYearAtRetraction',
        'CollabMAGCumCitationsAtRetraction',
        'CollabMAGCumCollaboratorsYearAtRetraction',
        'CollabMAGCumCollaboratorsAtRetraction',
        'CollabMAGCumPapersYearAtCollaboration',
        'CollabMAGCumPapersAtCollaboration',
        'CollabMAGCumCitationsYearAtCollaboration',
        'CollabMAGCumCitationsAtCollaboration',
        'CollabMAGCumCollaboratorsYearAtCollaboration',
        'CollabMAGCumCollaboratorsAtCollaboration',
        'ReasonPropagatedMajorityOfMajority', 'CollabAcademicAgeAtRetraction',
        'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
        'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
        'Nu

## DANGER ZONE!

This code removes collaborators that have academic age > 70 at the time of collaboration. 

In [20]:
df_A[df_A.CollabAcademicAgeAtCollaboration.gt(70) & df_A.ScientistType.eq('retracted')].MAGAID.nunique()

312

In [21]:
df_A = df_A[df_A.CollabAcademicAgeAtCollaboration.le(70)]

### A1: Collaborators retained: retracted vs. matched

In [22]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A1_post = df_A[df_A['PrePostFlag5']=='post5']

In [23]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A1_firstcollabs = df_A1_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A1_w_firstcollabs = df_A1_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A1_w_firstcollabs.shape

(206342, 35)

In [24]:
# Sensibility checks

df_A1_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
154838,5440459.0,97975655,2011.0,2011.0
154858,5440459.0,165967445,2012.0,2012.0
154800,5440459.0,238488652,2014.0,2014.0
154832,5440459.0,293709409,2012.0,2012.0
154833,5440459.0,298713076,2012.0,2012.0
154835,5440459.0,324850990,2011.0,2011.0
154802,5440459.0,698114488,2011.0,2011.0
154803,5440459.0,698114488,2012.0,2011.0
154804,5440459.0,698114488,2013.0,2011.0
154801,5440459.0,698114488,2014.0,2011.0


In [25]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A1_w_firstcollabs_only = df_A1_w_firstcollabs[df_A1_w_firstcollabs.MAGCollaborationYear == \
                                                df_A1_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A1_w_firstcollabs_only.shape

(155136, 35)

In [26]:
df_A1_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'ReasonPropagatedMajorityOfMajority', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       'NumNewCollaboratorsW

In [27]:
def create_stratified_dfs_retention(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'ReasonPropagatedMajorityOfMajority',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinRetained', 'NumRetentionW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinRetained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='misconduct']
    df_retracted_midcareer = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='plagiarism']
    df_retracted_senior = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='mistake']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='misconduct']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='plagiarism']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='mistake']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior,df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior
    

In [28]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_retention(df_A1_w_firstcollabs_only)

In [29]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(256, 286, 341, 256, 286, 341)

In [30]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_retention(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [31]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_retention(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_retention(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_retention(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
40749300,12.642857,37.857143,359.000000,88.928571
48740240,11.142857,40.142857,443.714286,126.000000
59171237,10.428571,85.000000,235.142857,154.428571
115663519,7.904762,18.238095,268.428571,56.428571
207280435,13.857143,28.500000,439.642857,61.642857
...,...,...,...,...
3023902287,12.654762,43.226190,855.238095,197.619048
3024473098,9.625000,22.500000,23.000000,76.750000
3052744139,8.812500,110.187500,898.125000,123.250000
3095995937,14.882353,63.882353,395.117647,91.294118


In [32]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_retention = []

for exp_field in exp_fields:
    dicts_retention = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_retention['Misconduct'] = dict_stats_j
    dicts_retention['Plagiarism'] = dict_stats_m
    dicts_retention['Mistake'] = dict_stats_s
    
    lst_dicts_retention.append(dicts_retention)

In [33]:
pd.DataFrame(lst_dicts_retention[0])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabAcademicAgeAtCollaboration_retracted_mean,13.56,12.71,15.07
CollabAcademicAgeAtCollaboration_retracted_median,13.65,12.14,15.29
CollabAcademicAgeAtCollaboration_retracted_std,7.43,7.03,6.55
CollabAcademicAgeAtCollaboration_nonretracted_mean,14.78,15.24,15.71
CollabAcademicAgeAtCollaboration_nonretracted_median,14.42,14.1,15.0
CollabAcademicAgeAtCollaboration_nonretracted_std,6.41,7.11,6.69
CollabAcademicAgeAtCollaboration_delta_mean,-1.23,-2.54,-0.64
CollabAcademicAgeAtCollaboration_pval_welch,0.046,0.0,0.207
CollabAcademicAgeAtCollaboration_CI_95lower,-2.31,-3.62,-1.49
CollabAcademicAgeAtCollaboration_CI_95upper,-0.15,-1.45,0.21


In [34]:
pd.DataFrame(lst_dicts_retention[1])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumPapersAtCollaboration_retracted_mean,73.6,61.95,75.33
CollabMAGCumPapersAtCollaboration_retracted_median,56.59,44.25,61.0
CollabMAGCumPapersAtCollaboration_retracted_std,100.32,64.51,64.38
CollabMAGCumPapersAtCollaboration_nonretracted_mean,81.77,69.02,80.69
CollabMAGCumPapersAtCollaboration_nonretracted_median,60.94,57.84,61.89
CollabMAGCumPapersAtCollaboration_nonretracted_std,104.12,53.99,71.0
CollabMAGCumPapersAtCollaboration_delta_mean,-8.16,-7.07,-5.36
CollabMAGCumPapersAtCollaboration_pval_welch,0.367,0.156,0.302
CollabMAGCumPapersAtCollaboration_CI_95lower,-25.7,-16.5,-14.74
CollabMAGCumPapersAtCollaboration_CI_95upper,9.37,2.35,4.03


In [35]:
pd.DataFrame(lst_dicts_retention[2])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumCitationsAtCollaboration_retracted_mean,1480.05,981.9,1778.22
CollabMAGCumCitationsAtCollaboration_retracted_median,818.35,409.83,1087.6
CollabMAGCumCitationsAtCollaboration_retracted_std,1962.39,2121.71,2193.66
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,1790.19,1045.83,1824.58
CollabMAGCumCitationsAtCollaboration_nonretracted_median,902.49,577.98,977.36
CollabMAGCumCitationsAtCollaboration_nonretracted_std,2870.8,1318.45,2794.45
CollabMAGCumCitationsAtCollaboration_delta_mean,-310.14,-63.93,-46.35
CollabMAGCumCitationsAtCollaboration_pval_welch,0.154,0.665,0.81
CollabMAGCumCitationsAtCollaboration_CI_95lower,-701.97,-337.87,-394.47
CollabMAGCumCitationsAtCollaboration_CI_95upper,81.68,210.01,301.76


In [36]:
pd.DataFrame(lst_dicts_retention[3])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,181.99,153.51,207.92
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,131.15,84.01,133.06
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,219.75,186.01,279.89
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,218.3,161.11,197.27
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,147.75,109.83,125.79
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,393.01,179.8,221.37
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,-36.31,-7.6,10.66
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.198,0.62,0.581
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,-92.33,-33.9,-24.96
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,19.71,18.71,46.28


### A2: Collaborators gained: retracted vs. matched

In [37]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A2_post = df_A[df_A['PrePostFlag5']=='post5']

In [38]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A2_firstcollabs = df_A2_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A2_w_firstcollabs = df_A2_post.merge(df_A2_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A2_w_firstcollabs.shape

(206342, 35)

In [39]:
# Sensibility checks

df_A2_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
154838,5440459.0,97975655,2011.0,2011.0
154858,5440459.0,165967445,2012.0,2012.0
154800,5440459.0,238488652,2014.0,2014.0
154832,5440459.0,293709409,2012.0,2012.0
154833,5440459.0,298713076,2012.0,2012.0
154835,5440459.0,324850990,2011.0,2011.0
154802,5440459.0,698114488,2011.0,2011.0
154803,5440459.0,698114488,2012.0,2011.0
154804,5440459.0,698114488,2013.0,2011.0
154801,5440459.0,698114488,2014.0,2011.0


In [40]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A2_w_firstcollabs_only = df_A2_w_firstcollabs[df_A2_w_firstcollabs.MAGCollaborationYear == \
                                                df_A2_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A2_w_firstcollabs_only.shape

(155136, 35)

In [41]:
df_A2_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'ReasonPropagatedMajorityOfMajority', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       'NumNewCollaboratorsW

In [42]:
def create_stratified_dfs_gain(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'ReasonPropagatedMajorityOfMajority',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinGained', 'NumNewCollaboratorsW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinGained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='misconduct']
    df_retracted_midcareer = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='plagiarism']
    df_retracted_senior = df_retracted[df_retracted.ReasonPropagatedMajorityOfMajority=='mistake']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='misconduct']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='plagiarism']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.ReasonPropagatedMajorityOfMajority=='mistake']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior,df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior
    

In [43]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_gain(df_A2_w_firstcollabs_only)

In [44]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(260, 292, 368, 260, 292, 368)

In [45]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_gain(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [46]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_gain(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_gain(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_gain(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
40749300,8.565217,39.456522,601.043478,87.195652
48740240,9.166667,34.000000,624.500000,71.333333
59171237,5.952381,9.333333,55.571429,27.190476
62104001,9.000000,4.000000,22.000000,10.000000
115663519,3.840000,13.906667,183.520000,52.933333
...,...,...,...,...
3024473098,2.764706,5.117647,1.823529,19.588235
3052744139,6.987405,53.289624,1168.337350,245.023625
3095995937,9.129032,51.193548,492.774194,109.580645
3166925194,13.375000,29.375000,375.750000,90.500000


In [47]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_gain = []

for exp_field in exp_fields:
    dicts_gain = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_gain['Misconduct'] = dict_stats_j
    dicts_gain['Plagiarism'] = dict_stats_m
    dicts_gain['Mistake'] = dict_stats_s
    
    lst_dicts_gain.append(dicts_gain)

In [48]:
pd.DataFrame(lst_dicts_gain[0])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabAcademicAgeAtCollaboration_retracted_mean,7.98,6.68,8.61
CollabAcademicAgeAtCollaboration_retracted_median,7.55,6.33,8.61
CollabAcademicAgeAtCollaboration_retracted_std,6.07,4.83,4.83
CollabAcademicAgeAtCollaboration_nonretracted_mean,8.1,7.74,8.14
CollabAcademicAgeAtCollaboration_nonretracted_median,7.92,7.56,7.84
CollabAcademicAgeAtCollaboration_nonretracted_std,4.48,4.75,4.53
CollabAcademicAgeAtCollaboration_delta_mean,-0.12,-1.06,0.47
CollabAcademicAgeAtCollaboration_pval_welch,0.792,0.008,0.172
CollabAcademicAgeAtCollaboration_CI_95lower,-1.0,-1.78,-0.14
CollabAcademicAgeAtCollaboration_CI_95upper,0.76,-0.34,1.09


In [49]:
pd.DataFrame(lst_dicts_gain[1])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumPapersAtCollaboration_retracted_mean,45.01,32.43,41.39
CollabMAGCumPapersAtCollaboration_retracted_median,28.36,22.32,34.3
CollabMAGCumPapersAtCollaboration_retracted_std,62.33,34.17,42.97
CollabMAGCumPapersAtCollaboration_nonretracted_mean,34.42,33.89,37.49
CollabMAGCumPapersAtCollaboration_nonretracted_median,31.27,28.24,26.77
CollabMAGCumPapersAtCollaboration_nonretracted_std,25.45,30.07,39.04
CollabMAGCumPapersAtCollaboration_delta_mean,10.59,-1.46,3.9
CollabMAGCumPapersAtCollaboration_pval_welch,0.012,0.584,0.198
CollabMAGCumPapersAtCollaboration_CI_95lower,2.33,-6.31,-1.86
CollabMAGCumPapersAtCollaboration_CI_95upper,18.85,3.39,9.65


In [50]:
pd.DataFrame(lst_dicts_gain[2])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumCitationsAtCollaboration_retracted_mean,1105.93,603.46,1075.94
CollabMAGCumCitationsAtCollaboration_retracted_median,501.37,271.05,624.41
CollabMAGCumCitationsAtCollaboration_retracted_std,1928.01,1149.47,1575.77
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,702.35,625.35,946.34
CollabMAGCumCitationsAtCollaboration_nonretracted_median,473.39,299.33,429.64
CollabMAGCumCitationsAtCollaboration_nonretracted_std,856.96,937.57,1554.61
CollabMAGCumCitationsAtCollaboration_delta_mean,403.58,-21.9,129.6
CollabMAGCumCitationsAtCollaboration_pval_welch,0.002,0.801,0.262
CollabMAGCumCitationsAtCollaboration_CI_95lower,160.25,-182.45,-94.11
CollabMAGCumCitationsAtCollaboration_CI_95upper,646.9,138.66,353.31


In [51]:
pd.DataFrame(lst_dicts_gain[3])

Unnamed: 0,Misconduct,Plagiarism,Mistake
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,156.42,103.74,206.15
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,79.56,50.84,88.21
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,319.8,149.55,1031.16
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,110.78,102.41,123.36
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,73.73,58.68,70.96
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,132.79,147.47,169.27
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,45.63,1.33,82.79
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.034,0.914,0.129
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,3.21,-20.02,-24.66
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,88.06,22.67,190.24


In [52]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Misconduct').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Misconduct').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Plagiarism').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Plagiarism').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mistake').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mistake').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Misconduct').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Misconduct').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Plagiarism').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Plagiarism').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mistake').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mistake').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))
    

for i in range(len(lst_dicts_retention)):
    dicto = lst_dicts_retention[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto, col)
    
    
# pd.DataFrame(lst_dicts_retention[0])

CollabAcademicAgeAtCollaboration
& 13.56 & 14.78 & 12.71 & 15.24 & 15.07 & 15.71\\ 

& 13.65 & 14.42 & 12.14 & 14.1 & 15.29 & 15.0\\ 

& 7.43 & 6.41 & 7.03 & 7.11 & 6.55 & 6.69\\ 

& 0.046 & 0.046 & 0.0 & 0.0 & 0.207 & 0.207\\ 

CollabMAGCumPapersAtCollaboration
& 73.6 & 81.77 & 61.95 & 69.02 & 75.33 & 80.69\\ 

& 56.59 & 60.94 & 44.25 & 57.84 & 61.0 & 61.89\\ 

& 100.32 & 104.12 & 64.51 & 53.99 & 64.38 & 71.0\\ 

& 0.367 & 0.367 & 0.156 & 0.156 & 0.302 & 0.302\\ 

CollabMAGCumCitationsAtCollaboration
& 1480.05 & 1790.19 & 981.9 & 1045.83 & 1778.22 & 1824.58\\ 

& 818.35 & 902.49 & 409.83 & 577.98 & 1087.6 & 977.36\\ 

& 1962.39 & 2870.8 & 2121.71 & 1318.45 & 2193.66 & 2794.45\\ 

& 0.154 & 0.154 & 0.665 & 0.665 & 0.81 & 0.81\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 181.99 & 218.3 & 153.51 & 161.11 & 207.92 & 197.27\\ 

& 131.15 & 147.75 & 84.01 & 109.83 & 133.06 & 125.79\\ 

& 219.75 & 393.01 & 186.01 & 179.8 & 279.89 & 221.37\\ 

& 0.198 & 0.198 & 0.62 & 0.62 & 0.581 & 0.581\\ 

In [53]:
for i in range(len(lst_dicts_gain)):
    dicto = lst_dicts_gain[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto, col)

CollabAcademicAgeAtCollaboration
& 7.98 & 8.1 & 6.68 & 7.74 & 8.61 & 8.14\\ 

& 7.55 & 7.92 & 6.33 & 7.56 & 8.61 & 7.84\\ 

& 6.07 & 4.48 & 4.83 & 4.75 & 4.83 & 4.53\\ 

& 0.792 & 0.792 & 0.008 & 0.008 & 0.172 & 0.172\\ 

CollabMAGCumPapersAtCollaboration
& 45.01 & 34.42 & 32.43 & 33.89 & 41.39 & 37.49\\ 

& 28.36 & 31.27 & 22.32 & 28.24 & 34.3 & 26.77\\ 

& 62.33 & 25.45 & 34.17 & 30.07 & 42.97 & 39.04\\ 

& 0.012 & 0.012 & 0.584 & 0.584 & 0.198 & 0.198\\ 

CollabMAGCumCitationsAtCollaboration
& 1105.93 & 702.35 & 603.46 & 625.35 & 1075.94 & 946.34\\ 

& 501.37 & 473.39 & 271.05 & 299.33 & 624.41 & 429.64\\ 

& 1928.01 & 856.96 & 1149.47 & 937.57 & 1575.77 & 1554.61\\ 

& 0.002 & 0.002 & 0.801 & 0.801 & 0.262 & 0.262\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 156.42 & 110.78 & 103.74 & 102.41 & 206.15 & 123.36\\ 

& 79.56 & 73.73 & 50.84 & 58.68 & 88.21 & 70.96\\ 

& 319.8 & 132.79 & 149.55 & 147.47 & 1031.16 & 169.27\\ 

& 0.034 & 0.034 & 0.914 & 0.914 & 0.129 & 0.129\\ 



### A3: Collaborators retained vs lost: retracted vs. matched

In [54]:
#Let us now modify df_A3 such that we remove all rows with collaborations pre-retraction

df_A3_post = df_A[df_A['PrePostFlag5']=='post5']
df_A3_pre = df_A[df_A['PrePostFlag5']=='pre']

In [55]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A3_firstcollabs = df_A3_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A3_w_firstcollabs = df_A3_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A3_w_firstcollabs.shape


(206342, 35)

In [56]:
# Sensibility checks

df_A3_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
154838,5440459.0,97975655,2011.0,2011.0
154858,5440459.0,165967445,2012.0,2012.0
154800,5440459.0,238488652,2014.0,2014.0
154832,5440459.0,293709409,2012.0,2012.0
154833,5440459.0,298713076,2012.0,2012.0
154835,5440459.0,324850990,2011.0,2011.0
154802,5440459.0,698114488,2011.0,2011.0
154803,5440459.0,698114488,2012.0,2011.0
154804,5440459.0,698114488,2013.0,2011.0
154801,5440459.0,698114488,2014.0,2011.0


In [57]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A3_w_firstcollabs_only = df_A3_w_firstcollabs[df_A3_w_firstcollabs.MAGCollaborationYear == \
                                                df_A3_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A3_w_firstcollabs_only.shape

(155136, 35)

In [58]:
df_A3_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'ReasonPropagatedMajorityOfMajority', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       'NumNewCollaboratorsW

In [59]:
# Finally let us merge post and pre

df_A3_post_pre = pd.concat([df_A3_w_firstcollabs_only,df_A3_pre])

df_A3_post_pre.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost,FirstPostRetractionMAGCollaborationYear
0,2105038000.0,2024377920,1994.0,1999.0,retracted,male,0.99,1992.0,1994.0,3.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
1,2105038000.0,2111173543,1994.0,1999.0,retracted,male,0.99,1979.0,1994.0,41.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
2,2105038000.0,2317413108,1994.0,1999.0,retracted,male,0.99,1998.0,,0.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
3,2105038000.0,2432858298,1994.0,1999.0,retracted,male,0.86,1962.0,1994.0,67.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
4,2105038000.0,2124401064,1994.0,1996.0,retracted,male,0.74,1964.0,1994.0,78.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",True,False,False,1996.0


In [60]:
def create_stratified_dfs_a3(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear',
               'CollabMAGCumPapersAtRetraction', 'CollabMAGCumCitationsAtRetraction',
               'CollabMAGCumCollaboratorsAtRetraction', 'ReasonPropagatedMajorityOfMajority',
               'CollabAcademicAgeAtRetraction', 'CollabAIDinRetained', 'CollabAIDinLost']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi_retained = dfi[dfi['CollabAIDinRetained']]
    dfi_lost = dfi[dfi['CollabAIDinLost']]
    
    # Dividing into retracted and matched
    df_retracted_retained = dfi_retained[dfi_retained.ScientistType == 'retracted']
    df_retracted_lost = dfi_lost[dfi_lost.ScientistType == 'retracted']
    
    df_nonretracted_retained = dfi_retained[dfi_retained.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    df_nonretracted_lost = dfi_lost[dfi_lost.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the four groups retracted,non-retracted,retained,lost have same ids
    
    set1 = set(df_retracted_retained['MAGAID'].unique())
    set2 = set(df_retracted_lost['MAGAID'].unique())
    set3 = set(df_nonretracted_retained['MAGAID'].unique())
    set4 = set(df_nonretracted_lost['MAGAID'].unique())
    
    magaids_intersection = set1.intersection(set2, set3, set4)
    
    df_retracted_retained = df_retracted_retained[df_retracted_retained.MAGAID.isin(magaids_intersection)]
    df_retracted_lost = df_retracted_lost[df_retracted_lost.MAGAID.isin(magaids_intersection)]
    df_nonretracted_retained = df_nonretracted_retained[df_nonretracted_retained.MAGAID.isin(magaids_intersection)]
    df_nonretracted_lost = df_nonretracted_lost[df_nonretracted_lost.MAGAID.isin(magaids_intersection)]

    
    # Dividing into seniority for retracted retained
    
    dfrj_r = df_retracted_retained[df_retracted_retained.ReasonPropagatedMajorityOfMajority=='misconduct']
    dfrm_r = df_retracted_retained[df_retracted_retained.ReasonPropagatedMajorityOfMajority=='plagiarism']
    dfrs_r = df_retracted_retained[df_retracted_retained.ReasonPropagatedMajorityOfMajority=='mistake']
    
    dfrj_l = df_retracted_lost[df_retracted_lost.ReasonPropagatedMajorityOfMajority=='misconduct']
    dfrm_l = df_retracted_lost[df_retracted_lost.ReasonPropagatedMajorityOfMajority=='plagiarism']
    dfrs_l = df_retracted_lost[df_retracted_lost.ReasonPropagatedMajorityOfMajority=='mistake']
    
    # and matched
    dfnrj_r = df_nonretracted_retained[df_nonretracted_retained.ReasonPropagatedMajorityOfMajority=='misconduct']
    dfnrm_r = df_nonretracted_retained[df_nonretracted_retained.ReasonPropagatedMajorityOfMajority=='plagiarism']
    dfnrs_r = df_nonretracted_retained[df_nonretracted_retained.ReasonPropagatedMajorityOfMajority=='mistake']
    
    dfnrj_l = df_nonretracted_lost[df_nonretracted_lost.ReasonPropagatedMajorityOfMajority=='misconduct']
    dfnrm_l = df_nonretracted_lost[df_nonretracted_lost.ReasonPropagatedMajorityOfMajority=='plagiarism']
    dfnrs_l = df_nonretracted_lost[df_nonretracted_lost.ReasonPropagatedMajorityOfMajority=='mistake']
    
    return [dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l]
    

In [61]:
lst_stratified_dfs = create_stratified_dfs_a3(df_A3_post_pre)

for dfj in lst_stratified_dfs:
    print(dfj.MAGAID.nunique())
    
dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l = lst_stratified_dfs

251
275
332
251
275
332
251
275
332
251
275
332


In [62]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_a3(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtRetraction':'MatchCollabMAGCumPapersAtRetraction',
                                    'CollabAcademicAgeAtRetraction':'MatchCollabAcademicAgeAtRetraction',
                                    'CollabMAGCumCitationsAtRetraction': 'MatchCollabMAGCumCitationsAtRetraction',
                                    'CollabMAGCumCollaboratorsAtRetraction': 'MatchCollabMAGCumCollaboratorsAtRetraction'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [63]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj_r, mean_dfnrj_r = get_mean_df_a3(dfrj_r, dfnrj_r)
mean_dfrj_l, mean_dfnrj_l = get_mean_df_a3(dfrj_l, dfnrj_l)

mean_dfrm_r, mean_dfnrm_r = get_mean_df_a3(dfrm_r, dfnrm_r)
mean_dfrm_l, mean_dfnrm_l = get_mean_df_a3(dfrm_l, dfnrm_l)

mean_dfrs_r, mean_dfnrs_r = get_mean_df_a3(dfrs_r, dfnrs_r)
mean_dfrs_l, mean_dfnrs_l = get_mean_df_a3(dfrs_l, dfnrs_l)


# Now let us compute differences

def compute_diff_df(df_ri, df_li, scientistType='retracted'):
    
    dfrli = df_ri.merge(df_li, right_index=True, left_index=True)
    
    if scientistType == 'matched':
        
        dfrli['MatchDiffAcademicAgeAtRetraction'] = dfrli['MatchCollabAcademicAgeAtRetraction_x'] - \
                                                dfrli['MatchCollabAcademicAgeAtRetraction_y']
        
        dfrli['MatchDiffMAGCumPapersAtRetraction'] = dfrli['MatchCollabMAGCumPapersAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumPapersAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCitationsAtRetraction'] = dfrli['MatchCollabMAGCumCitationsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCitationsAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCollaboratorsAtRetraction'] = dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_y']
        
        return dfrli
    
        
    dfrli['DiffAcademicAgeAtRetraction'] = dfrli['CollabAcademicAgeAtRetraction_x'] - \
                                            dfrli['CollabAcademicAgeAtRetraction_y']

    dfrli['DiffMAGCumPapersAtRetraction'] = dfrli['CollabMAGCumPapersAtRetraction_x'] - \
                                            dfrli['CollabMAGCumPapersAtRetraction_y']

    dfrli['DiffMAGCumCitationsAtRetraction'] = dfrli['CollabMAGCumCitationsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCitationsAtRetraction_y']

    dfrli['DiffMAGCumCollaboratorsAtRetraction'] = dfrli['CollabMAGCumCollaboratorsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCollaboratorsAtRetraction_y']

    return dfrli
    


In [64]:
dfrj_rMinusl = compute_diff_df(mean_dfrj_r, mean_dfrj_l)
dfnrj_rMinusl = compute_diff_df(mean_dfnrj_r, mean_dfnrj_l, scientistType='matched')


dfrm_rMinusl = compute_diff_df(mean_dfrm_r, mean_dfrm_l)
dfnrm_rMinusl = compute_diff_df(mean_dfnrm_r, mean_dfnrm_l, scientistType='matched')

dfrs_rMinusl = compute_diff_df(mean_dfrs_r, mean_dfrs_l)
dfnrs_rMinusl = compute_diff_df(mean_dfnrs_r, mean_dfnrs_l, scientistType='matched')

In [65]:
exp_fields = ['DiffAcademicAgeAtRetraction',
              'DiffMAGCumPapersAtRetraction',
              'DiffMAGCumCitationsAtRetraction',
              'DiffMAGCumCollaboratorsAtRetraction']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_a3 = []

for exp_field in exp_fields:
    dicts_a3 = {}
    
    dict_stats_j = get_stats(dfrj_rMinusl, dfnrj_rMinusl, exp_field)
    dict_stats_m = get_stats(dfrm_rMinusl, dfnrm_rMinusl, exp_field)
    dict_stats_s = get_stats(dfrs_rMinusl, dfnrs_rMinusl, exp_field)
    
    dicts_a3['Misconduct'] = dict_stats_j
    dicts_a3['Plagiarism'] = dict_stats_m
    dicts_a3['Mistake'] = dict_stats_s
    
    lst_dicts_a3.append(dicts_a3)

In [66]:
pd.DataFrame(lst_dicts_a3[0])

Unnamed: 0,Misconduct,Plagiarism,Mistake
DiffAcademicAgeAtRetraction_retracted_mean,1.71,1.9,2.27
DiffAcademicAgeAtRetraction_retracted_median,0.92,1.0,1.71
DiffAcademicAgeAtRetraction_retracted_std,6.52,5.67,6.73
DiffAcademicAgeAtRetraction_nonretracted_mean,2.05,2.86,2.35
DiffAcademicAgeAtRetraction_nonretracted_median,1.51,1.88,2.24
DiffAcademicAgeAtRetraction_nonretracted_std,6.24,6.99,6.4
DiffAcademicAgeAtRetraction_delta_mean,-0.34,-0.95,-0.08
DiffAcademicAgeAtRetraction_pval_welch,0.553,0.08,0.883
DiffAcademicAgeAtRetraction_CI_95lower,-1.36,-1.98,-0.98
DiffAcademicAgeAtRetraction_CI_95upper,0.68,0.07,0.83


In [67]:
pd.DataFrame(lst_dicts_a3[1])

Unnamed: 0,Misconduct,Plagiarism,Mistake
DiffMAGCumPapersAtRetraction_retracted_mean,25.67,19.56,19.39
DiffMAGCumPapersAtRetraction_retracted_median,9.71,9.5,13.97
DiffMAGCumPapersAtRetraction_retracted_std,88.15,51.93,57.56
DiffMAGCumPapersAtRetraction_nonretracted_mean,28.54,23.22,27.51
DiffMAGCumPapersAtRetraction_nonretracted_median,16.62,14.28,15.31
DiffMAGCumPapersAtRetraction_nonretracted_std,85.2,49.82,63.78
DiffMAGCumPapersAtRetraction_delta_mean,-2.87,-3.65,-8.12
DiffMAGCumPapersAtRetraction_pval_welch,0.711,0.4,0.085
DiffMAGCumPapersAtRetraction_CI_95lower,-17.83,-12.45,-16.83
DiffMAGCumPapersAtRetraction_CI_95upper,12.09,5.14,0.58


In [68]:
pd.DataFrame(lst_dicts_a3[2])

Unnamed: 0,Misconduct,Plagiarism,Mistake
DiffMAGCumCitationsAtRetraction_retracted_mean,251.03,158.23,184.52
DiffMAGCumCitationsAtRetraction_retracted_median,32.5,3.0,56.13
DiffMAGCumCitationsAtRetraction_retracted_std,1552.03,1283.6,1796.2
DiffMAGCumCitationsAtRetraction_nonretracted_mean,549.73,168.86,481.55
DiffMAGCumCitationsAtRetraction_nonretracted_median,117.37,27.7,117.89
DiffMAGCumCitationsAtRetraction_nonretracted_std,1992.59,1039.26,2164.78
DiffMAGCumCitationsAtRetraction_delta_mean,-298.7,-10.63,-297.03
DiffMAGCumCitationsAtRetraction_pval_welch,0.062,0.915,0.055
DiffMAGCumCitationsAtRetraction_CI_95lower,-609.16,-212.13,-609.02
DiffMAGCumCitationsAtRetraction_CI_95upper,11.76,190.87,14.96


In [69]:
pd.DataFrame(lst_dicts_a3[3])

Unnamed: 0,Misconduct,Plagiarism,Mistake
DiffMAGCumCollaboratorsAtRetraction_retracted_mean,30.72,29.74,41.91
DiffMAGCumCollaboratorsAtRetraction_retracted_median,14.72,8.94,16.8
DiffMAGCumCollaboratorsAtRetraction_retracted_std,195.62,139.08,217.6
DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean,38.1,40.96,44.75
DiffMAGCumCollaboratorsAtRetraction_nonretracted_median,35.29,27.0,23.53
DiffMAGCumCollaboratorsAtRetraction_nonretracted_std,416.61,118.81,155.45
DiffMAGCumCollaboratorsAtRetraction_delta_mean,-7.38,-11.22,-2.83
DiffMAGCumCollaboratorsAtRetraction_pval_welch,0.8,0.31,0.847
DiffMAGCumCollaboratorsAtRetraction_CI_95lower,-65.53,-33.62,-30.9
DiffMAGCumCollaboratorsAtRetraction_CI_95upper,50.77,11.19,25.23


In [70]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Misconduct').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Misconduct').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Plagiarism').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Plagiarism').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mistake').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mistake').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Misconduct').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Misconduct').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Plagiarism').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Plagiarism').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mistake').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mistake').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))
    

for i in range(len(lst_dicts_a3)):
    dicto = lst_dicts_a3[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto, col)

DiffAcademicAgeAtRetraction
& 1.71 & 2.05 & 1.9 & 2.86 & 2.27 & 2.35\\ 

& 0.92 & 1.51 & 1.0 & 1.88 & 1.71 & 2.24\\ 

& 6.52 & 6.24 & 5.67 & 6.99 & 6.73 & 6.4\\ 

& 0.553 & 0.553 & 0.08 & 0.08 & 0.883 & 0.883\\ 

DiffMAGCumPapersAtRetraction
& 25.67 & 28.54 & 19.56 & 23.22 & 19.39 & 27.51\\ 

& 9.71 & 16.62 & 9.5 & 14.28 & 13.97 & 15.31\\ 

& 88.15 & 85.2 & 51.93 & 49.82 & 57.56 & 63.78\\ 

& 0.711 & 0.711 & 0.4 & 0.4 & 0.085 & 0.085\\ 

DiffMAGCumCitationsAtRetraction
& 251.03 & 549.73 & 158.23 & 168.86 & 184.52 & 481.55\\ 

& 32.5 & 117.37 & 3.0 & 27.7 & 56.13 & 117.89\\ 

& 1552.03 & 1992.59 & 1283.6 & 1039.26 & 1796.2 & 2164.78\\ 

& 0.062 & 0.062 & 0.915 & 0.915 & 0.055 & 0.055\\ 

DiffMAGCumCollaboratorsAtRetraction
& 30.72 & 38.1 & 29.74 & 40.96 & 41.91 & 44.75\\ 

& 14.72 & 35.29 & 8.94 & 27.0 & 16.8 & 23.53\\ 

& 195.62 & 416.61 & 139.08 & 118.81 & 217.6 & 155.45\\ 

& 0.8 & 0.8 & 0.31 & 0.31 & 0.847 & 0.847\\ 



# Processing dictionaries for plots

In [71]:
expfield_categories = ['Academic Age','Number of Papers',
                       'Number of Citations', 'Number of Collaborators']

master_dict = {}

master_dict['Retention'] = {}

for i in range(len(expfield_categories)):
    master_dict['Retention'][expfield_categories[i]] = lst_dicts_retention[i]

master_dict['Gain'] = {}

for i in range(len(expfield_categories)):
    master_dict['Gain'][expfield_categories[i]] = lst_dicts_gain[i]
    
master_dict['DiD'] = {}

for i in range(len(expfield_categories)):
    master_dict['DiD'][expfield_categories[i]] = lst_dicts_a3[i]

In [72]:
master_dict.keys()

dict_keys(['Retention', 'Gain', 'DiD'])

In [73]:
master_dict

{'Retention': {'Academic Age': {'Misconduct': {'CollabAcademicAgeAtCollaboration_retracted_mean': 13.56,
    'CollabAcademicAgeAtCollaboration_retracted_median': 13.65,
    'CollabAcademicAgeAtCollaboration_retracted_std': 7.43,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.78,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.42,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 6.41,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.23,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.046,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -2.31,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.15},
   'Plagiarism': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.71,
    'CollabAcademicAgeAtCollaboration_retracted_median': 12.14,
    'CollabAcademicAgeAtCollaboration_retracted_std': 7.03,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.24,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.1,
    'CollabA

In [74]:
def save_dict(dicto, fname):
    import pickle 

    with open(fname, 'wb') as f:
        pickle.dump(dicto, f)
        
def read_dict(fname):
    import pickle
    
    with open(fname, 'rb') as f:
        loaded_dict = pickle.load(f)
        return loaded_dict

In [75]:
save_dict(master_dict, "collaborator_chars_byReason.pkl")

In [77]:
master_dict

{'Retention': {'Academic Age': {'Misconduct': {'CollabAcademicAgeAtCollaboration_retracted_mean': 13.56,
    'CollabAcademicAgeAtCollaboration_retracted_median': 13.65,
    'CollabAcademicAgeAtCollaboration_retracted_std': 7.43,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.78,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.42,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 6.41,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.23,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.046,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -2.31,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.15},
   'Plagiarism': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.71,
    'CollabAcademicAgeAtCollaboration_retracted_median': 12.14,
    'CollabAcademicAgeAtCollaboration_retracted_std': 7.03,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.24,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.1,
    'CollabA