# Characterizing Collaborators

In this notebook, we shall characterize collaborators. We shall do that in the following way:

There are three academic age groups: junior (0-3), mid (4-9), and senior (10-).

For each academic age group within retracted and matched scientists **at the time of retraction**, we shall conduct three analysis and create three tables:

#### Retained for retracted vs. matched
1. Table 1 comparing the **retained** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Gained for retracted vs. matched
2. Table 2 comparing the **gained/new** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Retained vs. lost for retracted vs. matched
3. Table 3 comparing the **retained** collaborators of retracted and matched scientists to those **lost** in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of retraction**. The table will be produced by difference in differences approach where we shall first compute the averages for each field (papers, citations, etc.) for retained and lost for retracted and matched. Then we shall compute the difference between retained for retracted and matched, and between lost for retracted and matched. Finally we shall take the difference in difference (DiD) i.e. **RETAINED-LOST**. The table will also contain median, standard deviation, and p-value for t-test.



In [1]:
import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats



In [2]:
INDIR = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/"
INDIR_MATCHING = INDIR+"/author_matching/"
INDIR_COLLAB = INDIR+"/collaborator_quality_analysis/"

df = pd.read_csv(INDIR_COLLAB+"/1Dcollaborators_for_matched_sample_30.csv")

print(df.shape)

df.head()

(773033, 20)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,CollabMAGCumCitationsYearAtRetraction,CollabMAGCumCitationsAtRetraction,CollabMAGCumCollaboratorsYearAtRetraction,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,1994.0,3683.0,1994.0,83.0,1983.0,47.0,1983.0,1076.0,1983.0,46
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1983.0,47.0,1983.0,1305.0,1983.0,37
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,1994.0,532.0,1983.0,18.0,1983.0,10.0,1983.0,199.0,1983.0,18
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1992.0,74.0,1992.0,2449.0,1992.0,71
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,1994.0,136.0,1993.0,31.0,1992.0,14.0,1992.0,90.0,1992.0,27


In [3]:
print(df.shape)

(773033, 20)


In [4]:
df.MAGCollabAID.nunique()

411911

### Preprocessing

In [5]:
df_treatment = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv",
                          usecols=['AcademicAgeBeforeRetraction'])

df_treatment

Unnamed: 0,AcademicAgeBeforeRetraction
0,2.0
1,5.0
2,5.0
3,22.0
4,10.0
...,...
2803,37.0
2804,6.0
2805,3.0
2806,3.0


In [6]:
# Let us first augment the academic age of MAGAIDs. We will also add other columns to be used later

# Reading files used for matching

df_treatment = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','AcademicAgeBeforeRetraction'])\
                    .drop_duplicates()\
                    .rename(columns={'AcademicAgeBeforeRetraction': 'AcademicAgeAtRetraction'})

df_control = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MatchMAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','AcademicAgeBeforeRetraction'])\
                    .drop_duplicates()\
                    .rename(columns={'MatchMAGAID':'MAGAID',
                                    'AcademicAgeBeforeRetraction': 'AcademicAgeAtRetraction'})

df_treatment_control = pd.concat([df_treatment,df_control])

# Now let us categorize age into 3 bins we discussed: 0-3, 4-9, and >10

def categorize_age(age):
    if age <= 3:
        return 'early-career author'
    elif (age > 3) and (age < 10):
        return 'mid-career author'
    elif age >= 10:
        return 'senior author'

df_treatment_control['AuthorSeniorityAtRetraction'] = df_treatment_control['AcademicAgeAtRetraction'].\
                                                        apply(lambda age: categorize_age(age))

df_treatment_control

Unnamed: 0,MAGAID,RetractionYear,AcademicAgeAtRetraction,NumRetentionW5,NumNewCollaboratorsW5,AuthorSeniorityAtRetraction
0,2.184860e+06,2008.0,2.0,6,5,early-career author
1,8.197726e+06,2012.0,5.0,4,10,mid-career author
3,9.474215e+06,2015.0,22.0,55,481,senior author
4,1.373700e+07,2014.0,10.0,8,8,senior author
5,1.551904e+07,2013.0,15.0,0,4,senior author
...,...,...,...,...,...,...
5413,2.100696e+09,2014.0,37.0,15,89,senior author
5414,1.933448e+09,2014.0,6.0,10,36,mid-career author
5415,2.304718e+09,2013.0,3.0,0,9,early-career author
5416,2.077873e+09,2008.0,3.0,2,3,early-career author


In [7]:
# Merging that with df

df2 = df.merge(df_treatment_control.drop(columns=['NumRetentionW5','NumNewCollaboratorsW5']), 
                                         on=['MAGAID','RetractionYear'])
df2

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumCollaboratorsYearAtRetraction,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction
0,2.105038e+09,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,1994.0,83.0,1983.0,47.0,1983.0,1076.0,1983.0,46,31.0,senior author
1,2.105038e+09,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1994.0,81.0,1983.0,47.0,1983.0,1305.0,1983.0,37,31.0,senior author
2,2.105038e+09,2486043001,1994.0,1983.0,retracted,male,0.60,1971.0,1983.0,10.0,...,1983.0,18.0,1983.0,10.0,1983.0,199.0,1983.0,18,31.0,senior author
3,2.105038e+09,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,1994.0,81.0,1992.0,74.0,1992.0,2449.0,1992.0,71,31.0,senior author
4,2.105038e+09,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,1993.0,31.0,1992.0,14.0,1992.0,90.0,1992.0,27,31.0,senior author
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
773028,2.294600e+09,2972696792,2012.0,2020.0,matched,male,0.99,2019.0,,0.0,...,,0.0,2020.0,3.0,2020.0,1.0,2020.0,17,8.0,mid-career author
773029,2.294600e+09,3111842016,2012.0,2020.0,matched,female,0.98,2020.0,,0.0,...,,0.0,2020.0,1.0,,0.0,2020.0,7,8.0,mid-career author
773030,2.294600e+09,3112134165,2012.0,2020.0,matched,female,0.99,2020.0,,0.0,...,,0.0,2020.0,1.0,,0.0,2020.0,7,8.0,mid-career author
773031,2.294600e+09,3112412017,2012.0,2020.0,matched,female,0.97,2020.0,,0.0,...,,0.0,2020.0,1.0,,0.0,2020.0,7,8.0,mid-career author


In [8]:
# Let us first compute academic age at retraction and at collaboration for collaborators
df2['CollabAcademicAgeAtRetraction'] = df2['RetractionYear']-df2['CollabMAGFirstPubYear']

df2['CollabAcademicAgeAtCollaboration'] = df2['MAGCollaborationYear']-df2['CollabMAGFirstPubYear']

# So negatives are possible in academic age at retraction but not collaboration
df2.CollabAcademicAgeAtRetraction.describe()

count    773033.000000
mean          9.718470
std          13.846648
min         -30.000000
25%           0.000000
50%           7.000000
75%          17.000000
max         215.000000
Name: CollabAcademicAgeAtRetraction, dtype: float64

In [9]:
df2.CollabAcademicAgeAtCollaboration.describe()

count    773033.000000
mean         10.852240
std          12.640748
min           0.000000
25%           1.000000
50%           7.000000
75%          17.000000
max         220.000000
Name: CollabAcademicAgeAtCollaboration, dtype: float64

In [10]:
# Let us first identify if the collaboration was pre- or post-retraction

def get_prepost_flag(row):
    if(row['MAGCollaborationYear'] <= row['RetractionYear']):
        return 'pre'
    else:
        if((row['MAGCollaborationYear']-row['RetractionYear'])<=5):
            return 'post5'
        else:
            return 'post'

df2['PrePostFlag5'] = df2.apply(lambda row: get_prepost_flag(row), axis=1)

In [11]:
# Let us remove the collaborators that are "post"

df3 = df2[~df2.PrePostFlag5.eq('post')]


In [12]:
df3.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,47.0,1983.0,1076.0,1983.0,46,31.0,senior author,27.0,16.0,pre
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,47.0,1983.0,1305.0,1983.0,37,31.0,senior author,30.0,19.0,pre
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,10.0,1983.0,199.0,1983.0,18,31.0,senior author,23.0,12.0,pre
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,74.0,1992.0,2449.0,1992.0,71,31.0,senior author,30.0,28.0,pre
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,14.0,1992.0,90.0,1992.0,27,31.0,senior author,10.0,8.0,pre


In [13]:
df3.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5'],
      dtype='object')

In [14]:
# For each MAGAID, let us create a set of collaborators pre- and post- retraction

df4 = df3.groupby(['MAGAID','RetractionYear','PrePostFlag5'])\
                        ['MAGCollabAID'].apply(set).unstack().reset_index()


# Converting pre- and post5 columns to set so we can do set operations
df4['pre'] = df4['pre'].apply(lambda d: d if isinstance(d, set) else set())
df4['post5'] = df4['post5'].apply(lambda d: d if isinstance(d, set) else set())


# COLLABORATOR RETENTION

# Computing number of collaborators retained
df4['NumRetentionW5'] = df4.apply(lambda row: len(row.post5.intersection(row.pre)), 
                            axis=1)

# Creating the list of collaborators retained
df4['CollabAIDRetainedW5'] = df4.apply(lambda row: row.post5.intersection(row.pre), 
                                                    axis=1)


# Creating list of collaborators lost
df4['CollabAIDLostW5'] = df4.apply(lambda row: row['pre'] - row['CollabAIDRetainedW5'], 
                                                    axis=1)


# COLLABORATOR GAIN

# Computing number of collabortors gained
df4['NumNewCollaboratorsW5'] = df4.apply(lambda row: len(row['post5']-row['pre']), 
                                                    axis=1)

# Creating set of collaborators gained
df4['CollabAIDGainedW5'] = df4.apply(lambda row: row['post5']-row['pre'], 
                                                    axis=1)


df4.head()

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,2184860.0,2008.0,"{1749937409, 2149399999, 2617454923, 260192180...","{2118754826, 2601921804, 2665868049, 230519298...",6,"{2617454923, 2601921804, 2032600174, 230576087...","{1834047680, 2038079680, 258432864, 2182005956...",5,"{1749937409, 1517361100, 2798255860, 251012216..."
1,5440459.0,2010.0,"{2084557827, 2402421397, 2235672854, 213407784...","{2132186624, 2084557827, 3130870283, 222369435...",13,"{2680186915, 2084557827, 2144828135, 97975655,...","{2132186624, 3130870283, 2223694350, 215816449...",35,"{2134077845, 2402421397, 2235672854, 217199132..."
2,8197726.0,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}","{279977545, 1778388943, 2506500849, 343947537,...",10,"{1969204096, 2226225926, 1993638631, 256742932..."
3,8227037.0,2003.0,"{2438245003, 2566236814, 2099366676, 186420623...","{2119475842, 2103663625, 3037713418, 276189799...",6,"{2698681252, 2464725413, 2135649512, 251608811...","{2119475842, 2103663625, 3037713418, 276189799...",20,"{2438245003, 1513362301, 2566236814, 209936667..."
4,9474215.0,2015.0,"{2741493764, 1134876678, 2589376522, 288503604...","{2102010368, 2147699201, 2402001923, 277873818...",55,"{2102010368, 2147699201, 2402001923, 211700557...","{2613431302, 1578585096, 2047308816, 216607540...",481,"{2741493764, 2589376522, 2885036046, 297136949..."


### Validation of the number of collaborators retained and gained 

We shall validate if the numbers we calculated now match the ones on which matching was done.

In [15]:
# Merging
dfvalidation = df4[['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5']].drop_duplicates().\
                    merge(df_treatment_control, on=['MAGAID','RetractionYear'])

dfvalidation

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,AcademicAgeAtRetraction,NumRetentionW5_y,NumNewCollaboratorsW5_y,AuthorSeniorityAtRetraction
0,2.184860e+06,2008.0,6,5,2.0,6,5,early-career author
1,5.440459e+06,2010.0,13,35,12.0,13,35,senior author
2,8.197726e+06,2012.0,4,10,5.0,4,10,mid-career author
3,8.227037e+06,2003.0,6,20,2.0,6,20,early-career author
4,9.474215e+06,2015.0,55,481,22.0,55,481,senior author
...,...,...,...,...,...,...,...,...
4764,3.174448e+09,2008.0,1,0,2.0,1,0,early-career author
4765,3.174844e+09,2014.0,3,3,6.0,3,3,mid-career author
4766,3.175436e+09,2015.0,1,14,6.0,1,14,mid-career author
4767,3.176126e+09,2004.0,4,1,8.0,4,1,mid-career author


In [16]:
# Finally validating

dfvalidation[(dfvalidation.NumRetentionW5_x == dfvalidation.NumRetentionW5_y) & 
            (dfvalidation.NumNewCollaboratorsW5_x == dfvalidation.NumNewCollaboratorsW5_y)]

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,AcademicAgeAtRetraction,NumRetentionW5_y,NumNewCollaboratorsW5_y,AuthorSeniorityAtRetraction
0,2.184860e+06,2008.0,6,5,2.0,6,5,early-career author
1,5.440459e+06,2010.0,13,35,12.0,13,35,senior author
2,8.197726e+06,2012.0,4,10,5.0,4,10,mid-career author
3,8.227037e+06,2003.0,6,20,2.0,6,20,early-career author
4,9.474215e+06,2015.0,55,481,22.0,55,481,senior author
...,...,...,...,...,...,...,...,...
4764,3.174448e+09,2008.0,1,0,2.0,1,0,early-career author
4765,3.174844e+09,2014.0,3,3,6.0,3,3,mid-career author
4766,3.175436e+09,2015.0,1,14,6.0,1,14,mid-career author
4767,3.176126e+09,2004.0,4,1,8.0,4,1,mid-career author


**Hence all of them are validated.**

## Analysis

In [17]:
# Our main dataframes are df3 and df4
# Let us look at them first
print(df3.shape)
df3.head()

(575753, 25)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,47.0,1983.0,1076.0,1983.0,46,31.0,senior author,27.0,16.0,pre
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,47.0,1983.0,1305.0,1983.0,37,31.0,senior author,30.0,19.0,pre
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,10.0,1983.0,199.0,1983.0,18,31.0,senior author,23.0,12.0,pre
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,...,74.0,1992.0,2449.0,1992.0,71,31.0,senior author,30.0,28.0,pre
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,...,14.0,1992.0,90.0,1992.0,27,31.0,senior author,10.0,8.0,pre


In [18]:
df4

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,2.184860e+06,2008.0,"{1749937409, 2149399999, 2617454923, 260192180...","{2118754826, 2601921804, 2665868049, 230519298...",6,"{2617454923, 2601921804, 2032600174, 230576087...","{1834047680, 2038079680, 258432864, 2182005956...",5,"{1749937409, 1517361100, 2798255860, 251012216..."
1,5.440459e+06,2010.0,"{2084557827, 2402421397, 2235672854, 213407784...","{2132186624, 2084557827, 3130870283, 222369435...",13,"{2680186915, 2084557827, 2144828135, 97975655,...","{2132186624, 3130870283, 2223694350, 215816449...",35,"{2134077845, 2402421397, 2235672854, 217199132..."
2,8.197726e+06,2012.0,"{1969204096, 2226225926, 2024673639, 199363863...","{2024673639, 279977545, 1970425162, 1689175372...",4,"{1970425162, 1689175372, 2122225613, 2024673639}","{279977545, 1778388943, 2506500849, 343947537,...",10,"{1969204096, 2226225926, 1993638631, 256742932..."
3,8.227037e+06,2003.0,"{2438245003, 2566236814, 2099366676, 186420623...","{2119475842, 2103663625, 3037713418, 276189799...",6,"{2698681252, 2464725413, 2135649512, 251608811...","{2119475842, 2103663625, 3037713418, 276189799...",20,"{2438245003, 1513362301, 2566236814, 209936667..."
4,9.474215e+06,2015.0,"{2741493764, 1134876678, 2589376522, 288503604...","{2102010368, 2147699201, 2402001923, 277873818...",55,"{2102010368, 2147699201, 2402001923, 211700557...","{2613431302, 1578585096, 2047308816, 216607540...",481,"{2741493764, 2589376522, 2885036046, 297136949..."
...,...,...,...,...,...,...,...,...,...
4764,3.174448e+09,2008.0,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",1,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",0,{}
4765,3.174844e+09,2014.0,"{2636262617, 550125002, 2295148299, 2265510894...","{2954067065, 2311908582, 2005715177, 550125002...",3,"{2174600848, 2636262617, 550125002}","{2311908582, 2005715177, 2124843274, 250398225...",3,"{1494968409, 2295148299, 2265510894}"
4766,3.175436e+09,2015.0,"{1805786912, 2999619457, 2658197410, 257933920...","{2130470407, 2395301650, 1455333013, 231293572...",1,{2121913688},"{2240552385, 2130470407, 2333910471, 252033530...",14,"{1805786912, 2999619457, 2658197410, 196858720..."
4767,3.176126e+09,2004.0,"{1986848736, 2166182598, 2098417261, 264898482...","{2509432581, 2132746631, 2105761291, 204507009...",4,"{1986848736, 2477984924, 2648984820, 2166182598}","{2780665794, 2117749316, 2509432581, 213274663...",1,{2098417261}


In [19]:
# Let us first merge df3 and df4

df_A = df3.merge(df4, on=['MAGAID','RetractionYear'])

# Let us also create three flags checking whether current collaborator is retained, gained, or lost

df_A['CollabAIDinRetained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDRetainedW5'], 
                                          axis=1)

df_A['CollabAIDinGained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDGainedW5'], 
                                          axis=1)

df_A['CollabAIDinLost'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDLostW5'], 
                                          axis=1)

df_A.head(3)

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,False,True
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",True,False,False
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,...,"{2024377920, 2111173543, 2124401064, 213957751...","{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,False,True


In [20]:
# Sensibility checks

df_A[['CollabAIDinRetained','CollabAIDinGained','CollabAIDinLost']].value_counts()

CollabAIDinRetained  CollabAIDinGained  CollabAIDinLost
False                False              True               239869
                     True               False              186673
True                 False              False              149211
Name: count, dtype: int64

In [21]:
df_A.columns, df_A.shape

(Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
        'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
        'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
        'CollabMAGCumPapersAtRetraction',
        'CollabMAGCumCitationsYearAtRetraction',
        'CollabMAGCumCitationsAtRetraction',
        'CollabMAGCumCollaboratorsYearAtRetraction',
        'CollabMAGCumCollaboratorsAtRetraction',
        'CollabMAGCumPapersYearAtCollaboration',
        'CollabMAGCumPapersAtCollaboration',
        'CollabMAGCumCitationsYearAtCollaboration',
        'CollabMAGCumCitationsAtCollaboration',
        'CollabMAGCumCollaboratorsYearAtCollaboration',
        'CollabMAGCumCollaboratorsAtCollaboration', 'AcademicAgeAtRetraction',
        'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
        'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
        'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAID

## DANGER ZONE!

This code removes collaborators that have academic age > 70 at the time of collaboration. 

In [22]:
#df_A[df_A.CollabAcademicAgeAtCollaboration.gt(70) & df_A.ScientistType.eq('retracted')].MAGAID.nunique()

In [23]:
#df_A = df_A[df_A.CollabAcademicAgeAtCollaboration.le(70)]

### A1: Collaborators retained: retracted vs. matched

In [24]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A1_post = df_A[df_A['PrePostFlag5']=='post5']

In [25]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A1_firstcollabs = df_A1_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A1_w_firstcollabs = df_A1_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A1_w_firstcollabs.shape

(252887, 36)

In [26]:
# Sensibility checks

df_A1_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
134954,2184860.0,250648223,2009.0,2009.0
134955,2184860.0,1517361100,2009.0,2009.0
134956,2184860.0,1749937409,2009.0,2009.0
134957,2184860.0,2032600174,2009.0,2009.0
134958,2184860.0,2149399999,2009.0,2009.0
134951,2184860.0,2305760879,2009.0,2009.0
134959,2184860.0,2510122165,2009.0,2009.0
134960,2184860.0,2517094809,2009.0,2009.0
134952,2184860.0,2601921804,2009.0,2009.0
134953,2184860.0,2617454923,2009.0,2009.0


In [27]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A1_w_firstcollabs_only = df_A1_w_firstcollabs[df_A1_w_firstcollabs.MAGCollaborationYear == \
                                                df_A1_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A1_w_firstcollabs_only.shape

(189364, 36)

In [28]:
df_A1_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       '

In [29]:
def create_stratified_dfs_retention(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinRetained', 'NumRetentionW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinRetained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='early-career author']
    df_retracted_midcareer = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_retracted_senior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='senior author']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='early-career author']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='senior author']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior, df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior


In [30]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_retention(df_A1_w_firstcollabs_only)

In [31]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(457, 284, 402, 457, 284, 402)

In [32]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_retention(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [33]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_retention(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_retention(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_retention(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2184860,4.250000,11.8750,39.000,39.125000
33433812,15.333333,91.0000,924.000,76.333333
63736147,4.200000,14.6000,45.200,22.400000
77871054,18.600000,130.6000,680.600,308.000000
94287040,7.600000,120.4000,1556.800,142.200000
...,...,...,...,...
3046204276,10.600000,69.8000,1802.000,148.200000
3049602676,21.000000,87.0000,285.000,24.000000
3052744139,8.812500,110.1875,898.125,123.250000
3061499449,5.000000,3.0000,3.000,9.000000


In [34]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_retention = []

for exp_field in exp_fields:
    dicts_retention = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_retention['Junior'] = dict_stats_j
    dicts_retention['Mid'] = dict_stats_m
    dicts_retention['Senior'] = dict_stats_s
    
    lst_dicts_retention.append(dicts_retention)

In [35]:
pd.DataFrame(lst_dicts_retention[0])

Unnamed: 0,Junior,Mid,Senior
CollabAcademicAgeAtCollaboration_retracted_mean,12.26,14.69,15.71
CollabAcademicAgeAtCollaboration_retracted_median,11.33,14.62,15.91
CollabAcademicAgeAtCollaboration_retracted_std,8.55,6.81,5.93
CollabAcademicAgeAtCollaboration_nonretracted_mean,14.01,15.73,16.79
CollabAcademicAgeAtCollaboration_nonretracted_median,13.12,15.24,15.85
CollabAcademicAgeAtCollaboration_nonretracted_std,6.91,6.89,7.14
CollabAcademicAgeAtCollaboration_delta_mean,-1.75,-1.04,-1.07
CollabAcademicAgeAtCollaboration_pval_welch,0.001,0.07,0.02
CollabAcademicAgeAtCollaboration_CI_95lower,-2.68,-2.06,-1.92
CollabAcademicAgeAtCollaboration_CI_95upper,-0.82,-0.02,-0.23


In [36]:
pd.DataFrame(lst_dicts_retention[1])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumPapersAtCollaboration_retracted_mean,65.43,80.72,69.83
CollabMAGCumPapersAtCollaboration_retracted_median,43.75,59.06,61.0
CollabMAGCumPapersAtCollaboration_retracted_std,71.35,103.82,56.88
CollabMAGCumPapersAtCollaboration_nonretracted_mean,77.13,85.26,74.28
CollabMAGCumPapersAtCollaboration_nonretracted_median,60.0,64.11,57.81
CollabMAGCumPapersAtCollaboration_nonretracted_std,92.06,83.75,66.04
CollabMAGCumPapersAtCollaboration_delta_mean,-11.7,-4.54,-4.45
CollabMAGCumPapersAtCollaboration_pval_welch,0.032,0.566,0.306
CollabMAGCumPapersAtCollaboration_CI_95lower,-21.89,-19.05,-12.77
CollabMAGCumPapersAtCollaboration_CI_95upper,-1.51,9.97,3.87


In [37]:
pd.DataFrame(lst_dicts_retention[2])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCitationsAtCollaboration_retracted_mean,1295.48,1537.61,1657.42
CollabMAGCumCitationsAtCollaboration_retracted_median,458.75,796.29,866.66
CollabMAGCumCitationsAtCollaboration_retracted_std,2334.67,2111.18,4262.91
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,1467.54,1809.91,1507.48
CollabMAGCumCitationsAtCollaboration_nonretracted_median,776.79,857.97,760.78
CollabMAGCumCitationsAtCollaboration_nonretracted_std,2439.67,3249.68,2277.16
CollabMAGCumCitationsAtCollaboration_delta_mean,-172.06,-272.31,149.95
CollabMAGCumCitationsAtCollaboration_pval_welch,0.276,0.237,0.534
CollabMAGCumCitationsAtCollaboration_CI_95lower,-464.65,-676.16,-307.24
CollabMAGCumCitationsAtCollaboration_CI_95upper,120.52,131.54,607.14


In [38]:
pd.DataFrame(lst_dicts_retention[3])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,156.08,209.88,204.57
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,89.2,121.15,137.65
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,220.26,313.91,256.89
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,198.13,202.18,192.76
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,113.8,123.92,129.49
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,342.34,238.46,342.27
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,-42.05,7.7,11.81
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.028,0.742,0.58
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,-79.3,-34.34,-28.58
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,-4.79,49.73,52.21


### A2: Collaborators gained: retracted vs. matched

In [39]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A2_post = df_A[df_A['PrePostFlag5']=='post5']

In [40]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A2_firstcollabs = df_A2_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A2_w_firstcollabs = df_A2_post.merge(df_A2_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A2_w_firstcollabs.shape

(252887, 36)

In [41]:
# Sensibility checks

df_A2_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
134954,2184860.0,250648223,2009.0,2009.0
134955,2184860.0,1517361100,2009.0,2009.0
134956,2184860.0,1749937409,2009.0,2009.0
134957,2184860.0,2032600174,2009.0,2009.0
134958,2184860.0,2149399999,2009.0,2009.0
134951,2184860.0,2305760879,2009.0,2009.0
134959,2184860.0,2510122165,2009.0,2009.0
134960,2184860.0,2517094809,2009.0,2009.0
134952,2184860.0,2601921804,2009.0,2009.0
134953,2184860.0,2617454923,2009.0,2009.0


In [42]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A2_w_firstcollabs_only = df_A2_w_firstcollabs[df_A2_w_firstcollabs.MAGCollaborationYear == \
                                                df_A2_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A2_w_firstcollabs_only.shape

(189364, 36)

In [43]:
df_A2_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       '

In [44]:
def create_stratified_dfs_gain(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinGained', 'NumNewCollaboratorsW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinGained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='early-career author']
    df_retracted_midcareer = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_retracted_senior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='senior author']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='early-career author']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='senior author']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior,df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior
    

In [45]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_gain(df_A2_w_firstcollabs_only)

In [46]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(487, 292, 408, 487, 292, 408)

In [47]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_gain(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [48]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_gain(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_gain(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_gain(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2184860,6.400000,7.800000,137.400000,29.400000
21686935,43.000000,97.000000,1729.000000,223.000000
53426528,8.833333,37.625000,1336.916667,110.791667
62104001,9.000000,4.000000,22.000000,10.000000
63736147,0.000000,1.000000,0.000000,3.750000
...,...,...,...,...
3046015184,10.931034,47.275862,2790.189655,286.379310
3049602676,0.000000,1.000000,0.000000,3.000000
3052744139,6.987405,53.289624,1168.337350,245.023625
3061499449,5.500000,2.500000,1.000000,8.500000


In [49]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_gain = []

for exp_field in exp_fields:
    dicts_gain = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_gain['Junior'] = dict_stats_j
    dicts_gain['Mid'] = dict_stats_m
    dicts_gain['Senior'] = dict_stats_s
    
    lst_dicts_gain.append(dicts_gain)

In [50]:
pd.DataFrame(lst_dicts_gain[0])

Unnamed: 0,Junior,Mid,Senior
CollabAcademicAgeAtCollaboration_retracted_mean,7.16,8.06,8.95
CollabAcademicAgeAtCollaboration_retracted_median,5.93,7.78,8.74
CollabAcademicAgeAtCollaboration_retracted_std,6.86,5.13,4.56
CollabAcademicAgeAtCollaboration_nonretracted_mean,7.86,7.95,8.72
CollabAcademicAgeAtCollaboration_nonretracted_median,7.59,7.57,8.49
CollabAcademicAgeAtCollaboration_nonretracted_std,5.35,4.9,4.58
CollabAcademicAgeAtCollaboration_delta_mean,-0.7,0.12,0.23
CollabAcademicAgeAtCollaboration_pval_welch,0.077,0.775,0.479
CollabAcademicAgeAtCollaboration_CI_95lower,-1.42,-0.63,-0.38
CollabAcademicAgeAtCollaboration_CI_95upper,0.02,0.87,0.83


In [51]:
pd.DataFrame(lst_dicts_gain[1])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumPapersAtCollaboration_retracted_mean,36.21,39.7,41.12
CollabMAGCumPapersAtCollaboration_retracted_median,20.49,29.24,34.32
CollabMAGCumPapersAtCollaboration_retracted_std,52.73,47.32,31.48
CollabMAGCumPapersAtCollaboration_nonretracted_mean,35.6,35.77,35.84
CollabMAGCumPapersAtCollaboration_nonretracted_median,28.83,27.57,29.51
CollabMAGCumPapersAtCollaboration_nonretracted_std,31.86,39.86,29.12
CollabMAGCumPapersAtCollaboration_delta_mean,0.61,3.93,5.28
CollabMAGCumPapersAtCollaboration_pval_welch,0.828,0.279,0.013
CollabMAGCumPapersAtCollaboration_CI_95lower,-4.81,-2.92,1.44
CollabMAGCumPapersAtCollaboration_CI_95upper,6.02,10.78,9.12


In [52]:
pd.DataFrame(lst_dicts_gain[2])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCitationsAtCollaboration_retracted_mean,871.58,886.13,924.74
CollabMAGCumCitationsAtCollaboration_retracted_median,246.0,396.1,569.13
CollabMAGCumCitationsAtCollaboration_retracted_std,1775.04,1575.87,1047.52
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,822.8,788.61,702.8
CollabMAGCumCitationsAtCollaboration_nonretracted_median,399.92,302.32,398.92
CollabMAGCumCitationsAtCollaboration_nonretracted_std,1199.4,1536.14,1001.46
CollabMAGCumCitationsAtCollaboration_delta_mean,48.78,97.52,221.93
CollabMAGCumCitationsAtCollaboration_pval_welch,0.615,0.449,0.002
CollabMAGCumCitationsAtCollaboration_CI_95lower,-136.24,-141.8,91.31
CollabMAGCumCitationsAtCollaboration_CI_95upper,233.81,336.84,352.56


In [53]:
pd.DataFrame(lst_dicts_gain[3])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,110.98,198.32,162.99
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,46.25,69.85,96.76
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,232.24,1139.99,256.03
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,122.35,110.32,113.69
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,68.64,56.86,73.19
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,180.67,185.77,130.94
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,-11.37,88.0,49.29
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.394,0.194,0.001
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,-37.27,-45.21,22.63
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,14.52,221.21,75.96


In [54]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Junior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Junior').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))
    
# pd.DataFrame(lst_dicts_retention[0])

In [55]:
for i in range(len(lst_dicts_retention)):
    dicto_retention = lst_dicts_retention[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_retention, col)

CollabAcademicAgeAtCollaboration
& 12.26 & 14.01 & 14.69 & 15.73 & 15.71 & 16.79\\ 

& 11.33 & 13.12 & 14.62 & 15.24 & 15.91 & 15.85\\ 

& 8.55 & 6.91 & 6.81 & 6.89 & 5.93 & 7.14\\ 

& 0.001 & 0.001 & 0.07 & 0.07 & 0.02 & 0.02\\ 

CollabMAGCumPapersAtCollaboration
& 65.43 & 77.13 & 80.72 & 85.26 & 69.83 & 74.28\\ 

& 43.75 & 60.0 & 59.06 & 64.11 & 61.0 & 57.81\\ 

& 71.35 & 92.06 & 103.82 & 83.75 & 56.88 & 66.04\\ 

& 0.032 & 0.032 & 0.566 & 0.566 & 0.306 & 0.306\\ 

CollabMAGCumCitationsAtCollaboration
& 1295.48 & 1467.54 & 1537.61 & 1809.91 & 1657.42 & 1507.48\\ 

& 458.75 & 776.79 & 796.29 & 857.97 & 866.66 & 760.78\\ 

& 2334.67 & 2439.67 & 2111.18 & 3249.68 & 4262.91 & 2277.16\\ 

& 0.276 & 0.276 & 0.237 & 0.237 & 0.534 & 0.534\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 156.08 & 198.13 & 209.88 & 202.18 & 204.57 & 192.76\\ 

& 89.2 & 113.8 & 121.15 & 123.92 & 137.65 & 129.49\\ 

& 220.26 & 342.34 & 313.91 & 238.46 & 256.89 & 342.27\\ 

& 0.028 & 0.028 & 0.742 & 0.742 & 0.58 & 

In [56]:
for i in range(len(lst_dicts_gain)):
    dicto_gain = lst_dicts_gain[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_gain, col)

CollabAcademicAgeAtCollaboration
& 7.16 & 7.86 & 8.06 & 7.95 & 8.95 & 8.72\\ 

& 5.93 & 7.59 & 7.78 & 7.57 & 8.74 & 8.49\\ 

& 6.86 & 5.35 & 5.13 & 4.9 & 4.56 & 4.58\\ 

& 0.077 & 0.077 & 0.775 & 0.775 & 0.479 & 0.479\\ 

CollabMAGCumPapersAtCollaboration
& 36.21 & 35.6 & 39.7 & 35.77 & 41.12 & 35.84\\ 

& 20.49 & 28.83 & 29.24 & 27.57 & 34.32 & 29.51\\ 

& 52.73 & 31.86 & 47.32 & 39.86 & 31.48 & 29.12\\ 

& 0.828 & 0.828 & 0.279 & 0.279 & 0.013 & 0.013\\ 

CollabMAGCumCitationsAtCollaboration
& 871.58 & 822.8 & 886.13 & 788.61 & 924.74 & 702.8\\ 

& 246.0 & 399.92 & 396.1 & 302.32 & 569.13 & 398.92\\ 

& 1775.04 & 1199.4 & 1575.87 & 1536.14 & 1047.52 & 1001.46\\ 

& 0.615 & 0.615 & 0.449 & 0.449 & 0.002 & 0.002\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 110.98 & 122.35 & 198.32 & 110.32 & 162.99 & 113.69\\ 

& 46.25 & 68.64 & 69.85 & 56.86 & 96.76 & 73.19\\ 

& 232.24 & 180.67 & 1139.99 & 185.77 & 256.03 & 130.94\\ 

& 0.394 & 0.394 & 0.194 & 0.194 & 0.001 & 0.001\\ 



In [57]:
dicto_retention

{'Junior': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 156.08,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 89.2,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 220.26,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 198.13,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median': 113.8,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std': 342.34,
  'CollabMAGCumCollaboratorsAtCollaboration_delta_mean': -42.05,
  'CollabMAGCumCollaboratorsAtCollaboration_pval_welch': 0.028,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95lower': -79.3,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95upper': -4.79},
 'Mid': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 209.88,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 121.15,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 313.91,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 202.18,
  'CollabMAGCumCollabor

In [58]:
dicto_gain

{'Junior': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 110.98,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 46.25,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 232.24,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 122.35,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median': 68.64,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std': 180.67,
  'CollabMAGCumCollaboratorsAtCollaboration_delta_mean': -11.37,
  'CollabMAGCumCollaboratorsAtCollaboration_pval_welch': 0.394,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95lower': -37.27,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95upper': 14.52},
 'Mid': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 198.32,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 69.85,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 1139.99,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 110.32,
  'CollabMAGCumCollab

### A3: Collaborators retained vs lost: retracted vs. matched

In [59]:
#Let us now modify df_A3 such that we remove all rows with collaborations pre-retraction

df_A3_post = df_A[df_A['PrePostFlag5']=='post5']
df_A3_pre = df_A[df_A['PrePostFlag5']=='pre']

In [60]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A3_firstcollabs = df_A3_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A3_w_firstcollabs = df_A3_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A3_w_firstcollabs.shape


(252887, 36)

In [61]:
# Sensibility checks

df_A3_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
134954,2184860.0,250648223,2009.0,2009.0
134955,2184860.0,1517361100,2009.0,2009.0
134956,2184860.0,1749937409,2009.0,2009.0
134957,2184860.0,2032600174,2009.0,2009.0
134958,2184860.0,2149399999,2009.0,2009.0
134951,2184860.0,2305760879,2009.0,2009.0
134959,2184860.0,2510122165,2009.0,2009.0
134960,2184860.0,2517094809,2009.0,2009.0
134952,2184860.0,2601921804,2009.0,2009.0
134953,2184860.0,2617454923,2009.0,2009.0


In [62]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A3_w_firstcollabs_only = df_A3_w_firstcollabs[df_A3_w_firstcollabs.MAGCollaborationYear == \
                                                df_A3_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A3_w_firstcollabs_only.shape

(189364, 36)

In [63]:
df_A3_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRetainedW5', 'CollabAIDLostW5',
       '

In [64]:
# Finally let us merge post and pre

df_A3_post_pre = pd.concat([df_A3_w_firstcollabs_only,df_A3_pre])

df_A3_post_pre.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost,FirstPostRetractionMAGCollaborationYear
0,2105038000.0,2024377920,1994.0,1999.0,retracted,male,0.99,1992.0,1994.0,3.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
1,2105038000.0,2111173543,1994.0,1999.0,retracted,male,0.99,1979.0,1994.0,41.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
2,2105038000.0,2317413108,1994.0,1999.0,retracted,male,0.99,1998.0,,0.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
3,2105038000.0,2432858298,1994.0,1999.0,retracted,male,0.86,1962.0,1994.0,67.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",False,True,False,1999.0
4,2105038000.0,2124401064,1994.0,1996.0,retracted,male,0.74,1964.0,1994.0,78.0,...,"{2935888513, 2004120834, 3069320578, 231952499...",4,"{2124401064, 2139577513, 2308282282, 2276877851}","{2004120834, 2787959316, 2464403478, 263399605...",4,"{2024377920, 2432858298, 2317413108, 2111173543}",True,False,False,1996.0


In [65]:
def create_stratified_dfs_a3(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear',
               'CollabMAGCumPapersAtRetraction', 'CollabMAGCumCitationsAtRetraction',
               'CollabMAGCumCollaboratorsAtRetraction', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtRetraction', 'CollabAIDinRetained', 'CollabAIDinLost']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi_retained = dfi[dfi['CollabAIDinRetained']]
    dfi_lost = dfi[dfi['CollabAIDinLost']]
    
    # Dividing into retracted and matched
    df_retracted_retained = dfi_retained[dfi_retained.ScientistType == 'retracted']
    df_retracted_lost = dfi_lost[dfi_lost.ScientistType == 'retracted']
    
    df_nonretracted_retained = dfi_retained[dfi_retained.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    df_nonretracted_lost = dfi_lost[dfi_lost.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the four groups retracted,non-retracted,retained,lost have same ids
    
    set1 = set(df_retracted_retained['MAGAID'].unique())
    set2 = set(df_retracted_lost['MAGAID'].unique())
    set3 = set(df_nonretracted_retained['MAGAID'].unique())
    set4 = set(df_nonretracted_lost['MAGAID'].unique())
    
    magaids_intersection = set1.intersection(set2, set3, set4)
    
    df_retracted_retained = df_retracted_retained[df_retracted_retained.MAGAID.isin(magaids_intersection)]
    df_retracted_lost = df_retracted_lost[df_retracted_lost.MAGAID.isin(magaids_intersection)]
    df_nonretracted_retained = df_nonretracted_retained[df_nonretracted_retained.MAGAID.isin(magaids_intersection)]
    df_nonretracted_lost = df_nonretracted_lost[df_nonretracted_lost.MAGAID.isin(magaids_intersection)]

    
    # Dividing into seniority for retracted retained
    
    dfrj_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='early-career author']
    dfrm_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='mid-career author']
    dfrs_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='senior author']
    
    dfrj_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='early-career author']
    dfrm_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='mid-career author']
    dfrs_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='senior author']
    
    # and matched
    dfnrj_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='early-career author']
    dfnrm_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='mid-career author']
    dfnrs_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='senior author']
    
    dfnrj_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='early-career author']
    dfnrm_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='mid-career author']
    dfnrs_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='senior author']
    
    return [dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l]
    

In [66]:
lst_stratified_dfs = create_stratified_dfs_a3(df_A3_post_pre)

for dfj in lst_stratified_dfs:
    print(dfj.MAGAID.nunique())
    
dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l = lst_stratified_dfs

428
282
402
428
282
402
428
282
402
428
282
402


In [67]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_a3(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtRetraction':'MatchCollabMAGCumPapersAtRetraction',
                                    'CollabAcademicAgeAtRetraction':'MatchCollabAcademicAgeAtRetraction',
                                    'CollabMAGCumCitationsAtRetraction': 'MatchCollabMAGCumCitationsAtRetraction',
                                    'CollabMAGCumCollaboratorsAtRetraction': 'MatchCollabMAGCumCollaboratorsAtRetraction'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [68]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj_r, mean_dfnrj_r = get_mean_df_a3(dfrj_r, dfnrj_r)
mean_dfrj_l, mean_dfnrj_l = get_mean_df_a3(dfrj_l, dfnrj_l)

mean_dfrm_r, mean_dfnrm_r = get_mean_df_a3(dfrm_r, dfnrm_r)
mean_dfrm_l, mean_dfnrm_l = get_mean_df_a3(dfrm_l, dfnrm_l)

mean_dfrs_r, mean_dfnrs_r = get_mean_df_a3(dfrs_r, dfnrs_r)
mean_dfrs_l, mean_dfnrs_l = get_mean_df_a3(dfrs_l, dfnrs_l)


# Now let us compute differences

def compute_diff_df(df_ri, df_li, scientistType='retracted'):
    
    dfrli = df_ri.merge(df_li, right_index=True, left_index=True)
    
    if scientistType == 'matched':
        
        dfrli['MatchDiffAcademicAgeAtRetraction'] = dfrli['MatchCollabAcademicAgeAtRetraction_x'] - \
                                                dfrli['MatchCollabAcademicAgeAtRetraction_y']
        
        dfrli['MatchDiffMAGCumPapersAtRetraction'] = dfrli['MatchCollabMAGCumPapersAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumPapersAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCitationsAtRetraction'] = dfrli['MatchCollabMAGCumCitationsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCitationsAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCollaboratorsAtRetraction'] = dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_y']
        
        return dfrli
    
        
    dfrli['DiffAcademicAgeAtRetraction'] = dfrli['CollabAcademicAgeAtRetraction_x'] - \
                                            dfrli['CollabAcademicAgeAtRetraction_y']

    dfrli['DiffMAGCumPapersAtRetraction'] = dfrli['CollabMAGCumPapersAtRetraction_x'] - \
                                            dfrli['CollabMAGCumPapersAtRetraction_y']

    dfrli['DiffMAGCumCitationsAtRetraction'] = dfrli['CollabMAGCumCitationsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCitationsAtRetraction_y']

    dfrli['DiffMAGCumCollaboratorsAtRetraction'] = dfrli['CollabMAGCumCollaboratorsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCollaboratorsAtRetraction_y']

    return dfrli
    


In [69]:
dfrj_rMinusl = compute_diff_df(mean_dfrj_r, mean_dfrj_l)
dfnrj_rMinusl = compute_diff_df(mean_dfnrj_r, mean_dfnrj_l, scientistType='matched')


dfrm_rMinusl = compute_diff_df(mean_dfrm_r, mean_dfrm_l)
dfnrm_rMinusl = compute_diff_df(mean_dfnrm_r, mean_dfnrm_l, scientistType='matched')

dfrs_rMinusl = compute_diff_df(mean_dfrs_r, mean_dfrs_l)
dfnrs_rMinusl = compute_diff_df(mean_dfnrs_r, mean_dfnrs_l, scientistType='matched')

In [70]:
exp_fields = ['DiffAcademicAgeAtRetraction',
              'DiffMAGCumPapersAtRetraction',
              'DiffMAGCumCitationsAtRetraction',
              'DiffMAGCumCollaboratorsAtRetraction']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_a3 = []

for exp_field in exp_fields:
    dicts_a3 = {}
    
    dict_stats_j = get_stats(dfrj_rMinusl, dfnrj_rMinusl, exp_field)
    dict_stats_m = get_stats(dfrm_rMinusl, dfnrm_rMinusl, exp_field)
    dict_stats_s = get_stats(dfrs_rMinusl, dfnrs_rMinusl, exp_field)
    
    dicts_a3['Junior'] = dict_stats_j
    dicts_a3['Mid'] = dict_stats_m
    dicts_a3['Senior'] = dict_stats_s
    
    lst_dicts_a3.append(dicts_a3)

In [71]:
pd.DataFrame(lst_dicts_a3[0])

Unnamed: 0,Junior,Mid,Senior
DiffAcademicAgeAtRetraction_retracted_mean,3.31,3.11,-0.14
DiffAcademicAgeAtRetraction_retracted_median,2.45,2.31,0.14
DiffAcademicAgeAtRetraction_retracted_std,8.31,6.19,5.3
DiffAcademicAgeAtRetraction_nonretracted_mean,3.85,3.45,0.45
DiffAcademicAgeAtRetraction_nonretracted_median,3.18,3.22,0.18
DiffAcademicAgeAtRetraction_nonretracted_std,6.81,6.57,6.7
DiffAcademicAgeAtRetraction_delta_mean,-0.55,-0.33,-0.59
DiffAcademicAgeAtRetraction_pval_welch,0.292,0.537,0.164
DiffAcademicAgeAtRetraction_CI_95lower,-1.53,-1.3,-1.42
DiffAcademicAgeAtRetraction_CI_95upper,0.44,0.63,0.24


In [72]:
pd.DataFrame(lst_dicts_a3[1])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumPapersAtRetraction_retracted_mean,23.28,29.47,15.35
DiffMAGCumPapersAtRetraction_retracted_median,10.95,13.36,9.73
DiffMAGCumPapersAtRetraction_retracted_std,64.26,90.05,40.7
DiffMAGCumPapersAtRetraction_nonretracted_mean,28.55,32.91,22.6
DiffMAGCumPapersAtRetraction_nonretracted_median,17.21,22.07,11.3
DiffMAGCumPapersAtRetraction_nonretracted_std,80.73,66.44,54.4
DiffMAGCumPapersAtRetraction_delta_mean,-5.27,-3.44,-7.26
DiffMAGCumPapersAtRetraction_pval_welch,0.291,0.606,0.033
DiffMAGCumPapersAtRetraction_CI_95lower,-14.65,-15.78,-14.19
DiffMAGCumPapersAtRetraction_CI_95upper,4.12,8.91,-0.32


In [73]:
pd.DataFrame(lst_dicts_a3[2])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumCitationsAtRetraction_retracted_mean,301.29,230.42,191.99
DiffMAGCumCitationsAtRetraction_retracted_median,52.54,32.45,-0.25
DiffMAGCumCitationsAtRetraction_retracted_std,2005.05,1577.27,2289.88
DiffMAGCumCitationsAtRetraction_nonretracted_mean,449.46,552.37,274.36
DiffMAGCumCitationsAtRetraction_nonretracted_median,145.96,135.07,-8.65
DiffMAGCumCitationsAtRetraction_nonretracted_std,1842.0,2083.55,1703.39
DiffMAGCumCitationsAtRetraction_delta_mean,-148.18,-321.95,-82.37
DiffMAGCumCitationsAtRetraction_pval_welch,0.261,0.039,0.563
DiffMAGCumCitationsAtRetraction_CI_95lower,-407.12,-621.27,-375.08
DiffMAGCumCitationsAtRetraction_CI_95upper,110.77,-22.62,210.35


In [74]:
pd.DataFrame(lst_dicts_a3[3])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumCollaboratorsAtRetraction_retracted_mean,37.81,57.14,31.26
DiffMAGCumCollaboratorsAtRetraction_retracted_median,17.58,16.11,9.07
DiffMAGCumCollaboratorsAtRetraction_retracted_std,198.99,240.47,205.6
DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean,51.86,62.8,34.17
DiffMAGCumCollaboratorsAtRetraction_nonretracted_median,32.33,35.94,18.02
DiffMAGCumCollaboratorsAtRetraction_nonretracted_std,344.57,162.51,140.63
DiffMAGCumCollaboratorsAtRetraction_delta_mean,-14.05,-5.66,-2.91
DiffMAGCumCollaboratorsAtRetraction_pval_welch,0.465,0.743,0.815
DiffMAGCumCollaboratorsAtRetraction_CI_95lower,-52.68,-37.89,-27.01
DiffMAGCumCollaboratorsAtRetraction_CI_95upper,24.58,26.57,21.18


In [75]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Junior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Junior').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))



for i in range(len(lst_dicts_a3)):
    dicto_did = lst_dicts_a3[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_did, col)

DiffAcademicAgeAtRetraction
& 3.31 & 3.85 & 3.11 & 3.45 & -0.14 & 0.45\\ 

& 2.45 & 3.18 & 2.31 & 3.22 & 0.14 & 0.18\\ 

& 8.31 & 6.81 & 6.19 & 6.57 & 5.3 & 6.7\\ 

& 0.292 & 0.292 & 0.537 & 0.537 & 0.164 & 0.164\\ 

DiffMAGCumPapersAtRetraction
& 23.28 & 28.55 & 29.47 & 32.91 & 15.35 & 22.6\\ 

& 10.95 & 17.21 & 13.36 & 22.07 & 9.73 & 11.3\\ 

& 64.26 & 80.73 & 90.05 & 66.44 & 40.7 & 54.4\\ 

& 0.291 & 0.291 & 0.606 & 0.606 & 0.033 & 0.033\\ 

DiffMAGCumCitationsAtRetraction
& 301.29 & 449.46 & 230.42 & 552.37 & 191.99 & 274.36\\ 

& 52.54 & 145.96 & 32.45 & 135.07 & -0.25 & -8.65\\ 

& 2005.05 & 1842.0 & 1577.27 & 2083.55 & 2289.88 & 1703.39\\ 

& 0.261 & 0.261 & 0.039 & 0.039 & 0.563 & 0.563\\ 

DiffMAGCumCollaboratorsAtRetraction
& 37.81 & 51.86 & 57.14 & 62.8 & 31.26 & 34.17\\ 

& 17.58 & 32.33 & 16.11 & 35.94 & 9.07 & 18.02\\ 

& 198.99 & 344.57 & 240.47 & 162.51 & 205.6 & 140.63\\ 

& 0.465 & 0.465 & 0.743 & 0.743 & 0.815 & 0.815\\ 



In [76]:
dicto_did

{'Junior': {'DiffMAGCumCollaboratorsAtRetraction_retracted_mean': 37.81,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_median': 17.58,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_std': 198.99,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean': 51.86,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_median': 32.33,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_std': 344.57,
  'DiffMAGCumCollaboratorsAtRetraction_delta_mean': -14.05,
  'DiffMAGCumCollaboratorsAtRetraction_pval_welch': 0.465,
  'DiffMAGCumCollaboratorsAtRetraction_CI_95lower': -52.68,
  'DiffMAGCumCollaboratorsAtRetraction_CI_95upper': 24.58},
 'Mid': {'DiffMAGCumCollaboratorsAtRetraction_retracted_mean': 57.14,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_median': 16.11,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_std': 240.47,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean': 62.8,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_median': 35.94,
  'DiffMAGCumCollaboratorsAtR

# Preprocessing dictionaries for plots

In [77]:
expfield_categories = ['Academic Age','Number of Papers',
                       'Number of Citations', 'Number of Collaborators']

master_dict = {}

master_dict['Retention'] = {}

for i in range(len(expfield_categories)):
    master_dict['Retention'][expfield_categories[i]] = lst_dicts_retention[i]

master_dict['Gain'] = {}

for i in range(len(expfield_categories)):
    master_dict['Gain'][expfield_categories[i]] = lst_dicts_gain[i]
    
master_dict['DiD'] = {}

for i in range(len(expfield_categories)):
    master_dict['DiD'][expfield_categories[i]] = lst_dicts_a3[i]

In [78]:
master_dict.keys()

dict_keys(['Retention', 'Gain', 'DiD'])

In [79]:
master_dict

{'Retention': {'Academic Age': {'Junior': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.26,
    'CollabAcademicAgeAtCollaboration_retracted_median': 11.33,
    'CollabAcademicAgeAtCollaboration_retracted_std': 8.55,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.01,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 13.12,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 6.91,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.75,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.001,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -2.68,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.82},
   'Mid': {'CollabAcademicAgeAtCollaboration_retracted_mean': 14.69,
    'CollabAcademicAgeAtCollaboration_retracted_median': 14.62,
    'CollabAcademicAgeAtCollaboration_retracted_std': 6.81,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.73,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 15.24,
    'CollabAcademicAge

In [80]:
pd.DataFrame.from_dict(master_dict)

Unnamed: 0,Retention,Gain,DiD
Academic Age,{'Junior': {'CollabAcademicAgeAtCollaboration_...,{'Junior': {'CollabAcademicAgeAtCollaboration_...,{'Junior': {'DiffAcademicAgeAtRetraction_retra...
Number of Papers,{'Junior': {'CollabMAGCumPapersAtCollaboration...,{'Junior': {'CollabMAGCumPapersAtCollaboration...,{'Junior': {'DiffMAGCumPapersAtRetraction_retr...
Number of Citations,{'Junior': {'CollabMAGCumCitationsAtCollaborat...,{'Junior': {'CollabMAGCumCitationsAtCollaborat...,{'Junior': {'DiffMAGCumCitationsAtRetraction_r...
Number of Collaborators,{'Junior': {'CollabMAGCumCollaboratorsAtCollab...,{'Junior': {'CollabMAGCumCollaboratorsAtCollab...,{'Junior': {'DiffMAGCumCollaboratorsAtRetracti...


In [81]:
def save_dict(dicto, fname):
    import pickle 

    with open(fname, 'wb') as f:
        pickle.dump(dicto, f)
        
def read_dict(fname):
    import pickle
    
    with open(fname, 'rb') as f:
        loaded_dict = pickle.load(f)
        return loaded_dict

In [82]:
save_dict(master_dict, "collaborator_chars_byAge.pkl")

In [83]:
dict_temp = read_dict("collaborator_chars_byAge.pkl")
dict_temp

{'Retention': {'Academic Age': {'Junior': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.26,
    'CollabAcademicAgeAtCollaboration_retracted_median': 11.33,
    'CollabAcademicAgeAtCollaboration_retracted_std': 8.55,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.01,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 13.12,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 6.91,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.75,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.001,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -2.68,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.82},
   'Mid': {'CollabAcademicAgeAtCollaboration_retracted_mean': 14.69,
    'CollabAcademicAgeAtCollaboration_retracted_median': 14.62,
    'CollabAcademicAgeAtCollaboration_retracted_std': 6.81,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.73,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 15.24,
    'CollabAcademicAge