# Characterizing Collaborators

In this notebook, we shall characterize collaborators. We shall do that in the following way:

There are three academic age groups: junior (0-3), mid (4-9), and senior (10-).

For each academic age group within retracted and matched scientists **at the time of retraction**, we shall conduct three analysis and create three tables:

#### Retained for retracted vs. matched
1. Table 1 comparing the **retained** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Gained for retracted vs. matched
2. Table 2 comparing the **gained/new** collaborators of retracted and matched scientists in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of collaboration**. The table will also contain median, standard deviation, and p-value for t-test.

#### Retained vs. lost for retracted vs. matched
3. Table 3 comparing the **retained** collaborators of retracted and matched scientists to those **lost** in terms of their (a) mean academic age, (b) average number of papers, (c) average number of citations, (d) average number of collaborators, all **at the time of retraction**. The table will be produced by difference in differences approach where we shall first compute the averages for each field (papers, citations, etc.) for retained and lost for retracted and matched. Then we shall compute the difference between retained for retracted and matched, and between lost for retracted and matched. Finally we shall take the difference in difference (DiD) i.e. **RETAINED-LOST**. The table will also contain median, standard deviation, and p-value for t-test.



In [84]:
import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [85]:
INDIR = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/processed/"
INDIR_MATCHING = INDIR+"/author_matching/"
INDIR_COLLAB = INDIR+"/collaborator_quality_analysis/"

df = pd.read_csv(INDIR_COLLAB+"/1Dcollaborators_for_matched_sample_30.csv")

print(df.shape)

df.head()

(773033, 20)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,CollabMAGCumCitationsYearAtRetraction,CollabMAGCumCitationsAtRetraction,CollabMAGCumCollaboratorsYearAtRetraction,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration
0,2105038000.0,2004120834,1994.0,1983.0,retracted,male,0.99,1967.0,1994.0,104.0,1994.0,3683.0,1994.0,83.0,1983.0,47.0,1983.0,1076.0,1983.0,46
1,2105038000.0,2124401064,1994.0,1983.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1983.0,47.0,1983.0,1305.0,1983.0,37
2,2105038000.0,2486043001,1994.0,1983.0,retracted,male,0.6,1971.0,1983.0,10.0,1994.0,532.0,1983.0,18.0,1983.0,10.0,1983.0,199.0,1983.0,18
3,2105038000.0,2124401064,1994.0,1992.0,retracted,male,0.74,1964.0,1994.0,78.0,1994.0,2668.0,1994.0,81.0,1992.0,74.0,1992.0,2449.0,1992.0,71
4,2105038000.0,2276877851,1994.0,1992.0,retracted,female,0.98,1984.0,1993.0,16.0,1994.0,136.0,1993.0,31.0,1992.0,14.0,1992.0,90.0,1992.0,27


In [86]:
print(df.shape)

(773033, 20)


In [87]:
df.MAGCollabAID.nunique()

411911

In [88]:
df_treatment = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv")
df_treatment['MAGAIDFirstORLastAuthorFlag']

0            MAGMiddleAuthor
1            MAGMiddleAuthor
2            MAGMiddleAuthor
3            MAGMiddleAuthor
4            MAGMiddleAuthor
                ...         
2803    MAGFirstOrLastAuthor
2804         MAGMiddleAuthor
2805         MAGMiddleAuthor
2806    MAGFirstOrLastAuthor
2807    MAGFirstOrLastAuthor
Name: MAGAIDFirstORLastAuthorFlag, Length: 2808, dtype: object

### Preprocessing

In [89]:
# Let us first augment the academic age of MAGAIDs. We will also add other columns to be used later

# Reading files used for matching

df_treatment = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_treatment_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','AcademicAgeBeforeRetraction',
                            'MAGAIDFirstORLastAuthorFlag'])\
                    .drop_duplicates()\
                    .rename(columns={'AcademicAgeBeforeRetraction': 'AcademicAgeAtRetraction'})

df_control = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MatchMAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5','AcademicAgeBeforeRetraction',
                            'MAGAIDFirstORLastAuthorFlag'])\
                    .drop_duplicates()\
                    .rename(columns={'MatchMAGAID':'MAGAID',
                                    'AcademicAgeBeforeRetraction': 'AcademicAgeAtRetraction'})

# Filtering process for choosing only first and last authors

df_treatment = df_treatment[df_treatment['MAGAIDFirstORLastAuthorFlag']=='MAGFirstOrLastAuthor']

df_control = df_control[df_control['MAGAIDFirstORLastAuthorFlag']=='MAGFirstOrLastAuthor']


df_treatment_control = pd.concat([df_treatment,df_control])

# Filtering process
df = df[df['MAGAID'].isin(df_treatment_control['MAGAID'])]

# Now let us categorize age into 3 bins we discussed: 0-3, 4-9, and >10

def categorize_age(age):
    if age <= 3:
        return 'early-career author'
    elif (age > 3) and (age < 10):
        return 'mid-career author'
    elif age >= 10:
        return 'senior author'

df_treatment_control['AuthorSeniorityAtRetraction'] = df_treatment_control['AcademicAgeAtRetraction'].\
                                                        apply(lambda age: categorize_age(age))

df_treatment_control

Unnamed: 0,MAGAID,RetractionYear,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,NumRetentionW5,NumNewCollaboratorsW5,AuthorSeniorityAtRetraction
20,1.839367e+08,2007.0,MAGFirstOrLastAuthor,2.0,8,79,early-career author
22,2.004364e+08,2012.0,MAGFirstOrLastAuthor,6.0,13,18,mid-career author
23,2.066031e+08,2012.0,MAGFirstOrLastAuthor,1.0,12,28,early-career author
25,2.072804e+08,2012.0,MAGFirstOrLastAuthor,23.0,20,88,senior author
26,2.074934e+08,2008.0,MAGFirstOrLastAuthor,22.0,13,55,senior author
...,...,...,...,...,...,...,...
5411,2.127710e+09,2015.0,MAGFirstOrLastAuthor,9.0,3,25,mid-career author
5412,1.974243e+09,2013.0,MAGFirstOrLastAuthor,6.0,3,2,mid-career author
5413,2.100696e+09,2014.0,MAGFirstOrLastAuthor,37.0,15,89,senior author
5416,2.077873e+09,2008.0,MAGFirstOrLastAuthor,3.0,2,3,early-career author


In [90]:
# Merging that with df

df2 = df.merge(df_treatment_control.drop(columns=['NumRetentionW5','NumNewCollaboratorsW5']), 
                                         on=['MAGAID','RetractionYear'])
df2

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumCollaboratorsAtRetraction,CollabMAGCumPapersYearAtCollaboration,CollabMAGCumPapersAtCollaboration,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction
0,2.033335e+09,1917877966,1995.0,1999.0,retracted,male,0.99,1994.0,1994.0,1.0,...,2.0,1999.0,14.0,1999.0,99.0,1999.0,31,MAGFirstOrLastAuthor,28.0,senior author
1,2.033335e+09,2169118091,1995.0,1999.0,retracted,male,1.00,1995.0,1995.0,1.0,...,3.0,1999.0,9.0,1999.0,8.0,1999.0,23,MAGFirstOrLastAuthor,28.0,senior author
2,2.033335e+09,275085591,1995.0,1997.0,retracted,male,0.99,1994.0,1995.0,5.0,...,11.0,1997.0,12.0,1997.0,53.0,1997.0,22,MAGFirstOrLastAuthor,28.0,senior author
3,2.033335e+09,2111014462,1995.0,1997.0,retracted,female,0.98,1988.0,1995.0,17.0,...,30.0,1997.0,22.0,1997.0,921.0,1997.0,42,MAGFirstOrLastAuthor,28.0,senior author
4,2.033335e+09,2622920657,1995.0,1997.0,retracted,male,0.99,1991.0,1995.0,17.0,...,33.0,1997.0,26.0,1997.0,306.0,1997.0,44,MAGFirstOrLastAuthor,28.0,senior author
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249984,2.147516e+09,2144146330,2009.0,2016.0,matched,male,0.81,1982.0,2009.0,284.0,...,464.0,2016.0,563.0,2016.0,16846.0,2016.0,3686,MAGFirstOrLastAuthor,4.0,mid-career author
249985,2.147516e+09,3146371485,2009.0,2015.0,matched,male,0.60,2015.0,,0.0,...,0.0,2015.0,1.0,2015.0,1.0,2015.0,4,MAGFirstOrLastAuthor,4.0,mid-career author
249986,2.147516e+09,3149630639,2009.0,2014.0,matched,male,0.97,2014.0,,0.0,...,0.0,2014.0,1.0,,0.0,2014.0,4,MAGFirstOrLastAuthor,4.0,mid-career author
249987,2.147516e+09,3165764306,2009.0,2014.0,matched,male,0.81,2014.0,,0.0,...,0.0,2014.0,1.0,,0.0,2014.0,4,MAGFirstOrLastAuthor,4.0,mid-career author


In [91]:
# Let us first compute academic age at retraction and at collaboration for collaborators
df2['CollabAcademicAgeAtRetraction'] = df2['RetractionYear']-df2['CollabMAGFirstPubYear']

df2['CollabAcademicAgeAtCollaboration'] = df2['MAGCollaborationYear']-df2['CollabMAGFirstPubYear']

# So negatives are possible in academic age at retraction but not collaboration
df2.CollabAcademicAgeAtRetraction.describe()

count    249989.000000
mean          9.189364
std          13.634279
min         -30.000000
25%           0.000000
50%           7.000000
75%          17.000000
max         215.000000
Name: CollabAcademicAgeAtRetraction, dtype: float64

In [92]:
df2.CollabAcademicAgeAtCollaboration.describe()

count    249989.000000
mean         10.315014
std          12.304358
min           0.000000
25%           1.000000
50%           6.000000
75%          16.000000
max         220.000000
Name: CollabAcademicAgeAtCollaboration, dtype: float64

In [93]:
# Let us first identify if the collaboration was pre- or post-retraction

def get_prepost_flag(row):
    if(row['MAGCollaborationYear'] <= row['RetractionYear']):
        return 'pre'
    else:
        if((row['MAGCollaborationYear']-row['RetractionYear'])<=5):
            return 'post5'
        else:
            return 'post'

df2['PrePostFlag5'] = df2.apply(lambda row: get_prepost_flag(row), axis=1)

In [94]:
# Let us remove the collaborators that are "post"

df3 = df2[~df2.PrePostFlag5.eq('post')]


In [95]:
df3.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2033335000.0,1917877966,1995.0,1999.0,retracted,male,0.99,1994.0,1994.0,1.0,...,1999.0,99.0,1999.0,31,MAGFirstOrLastAuthor,28.0,senior author,1.0,5.0,post5
1,2033335000.0,2169118091,1995.0,1999.0,retracted,male,1.0,1995.0,1995.0,1.0,...,1999.0,8.0,1999.0,23,MAGFirstOrLastAuthor,28.0,senior author,0.0,4.0,post5
2,2033335000.0,275085591,1995.0,1997.0,retracted,male,0.99,1994.0,1995.0,5.0,...,1997.0,53.0,1997.0,22,MAGFirstOrLastAuthor,28.0,senior author,1.0,3.0,post5
3,2033335000.0,2111014462,1995.0,1997.0,retracted,female,0.98,1988.0,1995.0,17.0,...,1997.0,921.0,1997.0,42,MAGFirstOrLastAuthor,28.0,senior author,7.0,9.0,post5
4,2033335000.0,2622920657,1995.0,1997.0,retracted,male,0.99,1991.0,1995.0,17.0,...,1997.0,306.0,1997.0,44,MAGFirstOrLastAuthor,28.0,senior author,4.0,6.0,post5


In [96]:
df3.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'MAGAIDFirstORLastAuthorFlag', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5'],
      dtype='object')

In [97]:
# For each MAGAID, let us create a set of collaborators pre- and post- retraction

df4 = df3.groupby(['MAGAID','RetractionYear','PrePostFlag5'])\
                        ['MAGCollabAID'].apply(set).unstack().reset_index()


# Converting pre- and post5 columns to set so we can do set operations
df4['pre'] = df4['pre'].apply(lambda d: d if isinstance(d, set) else set())
df4['post5'] = df4['post5'].apply(lambda d: d if isinstance(d, set) else set())


# COLLABORATOR RETENTION

# Computing number of collaborators retained
df4['NumRetentionW5'] = df4.apply(lambda row: len(row.post5.intersection(row.pre)), 
                            axis=1)

# Creating the list of collaborators retained
df4['CollabAIDRetainedW5'] = df4.apply(lambda row: row.post5.intersection(row.pre), 
                                                    axis=1)


# Creating list of collaborators lost
df4['CollabAIDLostW5'] = df4.apply(lambda row: row['pre'] - row['CollabAIDRetainedW5'], 
                                                    axis=1)


# COLLABORATOR GAIN

# Computing number of collabortors gained
df4['NumNewCollaboratorsW5'] = df4.apply(lambda row: len(row['post5']-row['pre']), 
                                                    axis=1)

# Creating set of collaborators gained
df4['CollabAIDGainedW5'] = df4.apply(lambda row: row['post5']-row['pre'], 
                                                    axis=1)


df4.head()

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,19100288.0,2002.0,"{18011520, 2181345027, 2608247304, 2043645593,...","{2437904394, 2171993227, 2043645593, 220887567...",5,"{2128982626, 1863203661, 2043645593, 410625722...","{2059887737, 2186312265, 2437904394, 217199322...",23,"{18011520, 2181345027, 2608247304, 2111905563,..."
1,21686935.0,2008.0,"{2646743714, 2800470565, 2117660071, 269971592...","{2170694433, 2064283617, 2327654243, 247583094...",0,{},"{2330648593, 2965357269, 3081263517, 250572015...",12,"{2646743714, 2800470565, 2117660071, 269971592..."
2,29680017.0,1997.0,"{2309217606, 2070053831, 1588714087, 252097620...","{287126053, 1971237384, 2023717609, 2273398858...",1,{2920459384},"{1971237384, 2273398858, 1998215821, 230597927...",4,"{2520976201, 1588714087, 2309217606, 2070053831}"
3,33433812.0,2009.0,"{2653850944, 2127386819, 2128942892, 2113925965}","{2653850944, 2127386819, 2572203943, 212894289...",4,"{2653850944, 2127386819, 2128942892, 2113925965}","{2137538457, 2149027090, 2572203943}",0,{}
4,41957466.0,2015.0,"{2034626565, 1848517639, 2597654027, 203201076...","{2152585733, 2142553607, 2110808072, 294896999...",32,"{2047324288, 2133711754, 1719173006, 219254094...","{2152585733, 2142553607, 2110808072, 294896999...",83,"{2034626565, 1848517639, 2597654027, 203201076..."


### Validation of the number of collaborators retained and gained 

We shall validate if the numbers we calculated now match the ones on which matching was done.

In [98]:
# Merging
dfvalidation = df4[['MAGAID','RetractionYear','NumRetentionW5','NumNewCollaboratorsW5']].drop_duplicates().\
                    merge(df_treatment_control, on=['MAGAID','RetractionYear'])

dfvalidation

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,NumRetentionW5_y,NumNewCollaboratorsW5_y,AuthorSeniorityAtRetraction
0,1.910029e+07,2002.0,5,23,MAGFirstOrLastAuthor,9.0,5,23,mid-career author
1,2.168694e+07,2008.0,0,12,MAGFirstOrLastAuthor,3.0,0,12,early-career author
2,2.968002e+07,1997.0,1,4,MAGFirstOrLastAuthor,5.0,1,4,mid-career author
3,3.343381e+07,2009.0,4,0,MAGFirstOrLastAuthor,2.0,4,0,early-career author
4,4.195747e+07,2015.0,32,83,MAGFirstOrLastAuthor,22.0,32,83,senior author
...,...,...,...,...,...,...,...,...,...
1757,3.173544e+09,2011.0,0,2,MAGFirstOrLastAuthor,6.0,0,2,mid-career author
1758,3.174124e+09,2004.0,1,6,MAGFirstOrLastAuthor,1.0,1,6,early-career author
1759,3.174448e+09,2008.0,1,0,MAGFirstOrLastAuthor,2.0,1,0,early-career author
1760,3.175436e+09,2015.0,1,14,MAGFirstOrLastAuthor,6.0,1,14,mid-career author


In [99]:
# Finally validating

dfvalidation[(dfvalidation.NumRetentionW5_x == dfvalidation.NumRetentionW5_y) & 
            (dfvalidation.NumNewCollaboratorsW5_x == dfvalidation.NumNewCollaboratorsW5_y)]

Unnamed: 0,MAGAID,RetractionYear,NumRetentionW5_x,NumNewCollaboratorsW5_x,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,NumRetentionW5_y,NumNewCollaboratorsW5_y,AuthorSeniorityAtRetraction
0,1.910029e+07,2002.0,5,23,MAGFirstOrLastAuthor,9.0,5,23,mid-career author
1,2.168694e+07,2008.0,0,12,MAGFirstOrLastAuthor,3.0,0,12,early-career author
2,2.968002e+07,1997.0,1,4,MAGFirstOrLastAuthor,5.0,1,4,mid-career author
3,3.343381e+07,2009.0,4,0,MAGFirstOrLastAuthor,2.0,4,0,early-career author
4,4.195747e+07,2015.0,32,83,MAGFirstOrLastAuthor,22.0,32,83,senior author
...,...,...,...,...,...,...,...,...,...
1757,3.173544e+09,2011.0,0,2,MAGFirstOrLastAuthor,6.0,0,2,mid-career author
1758,3.174124e+09,2004.0,1,6,MAGFirstOrLastAuthor,1.0,1,6,early-career author
1759,3.174448e+09,2008.0,1,0,MAGFirstOrLastAuthor,2.0,1,0,early-career author
1760,3.175436e+09,2015.0,1,14,MAGFirstOrLastAuthor,6.0,1,14,mid-career author


**Hence all of them are validated.**

## Analysis

In [100]:
# Our main dataframes are df3 and df4
# Let us look at them first
print(df3.shape)
df3.head()

(184830, 26)


Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,CollabMAGCumCitationsYearAtCollaboration,CollabMAGCumCitationsAtCollaboration,CollabMAGCumCollaboratorsYearAtCollaboration,CollabMAGCumCollaboratorsAtCollaboration,MAGAIDFirstORLastAuthorFlag,AcademicAgeAtRetraction,AuthorSeniorityAtRetraction,CollabAcademicAgeAtRetraction,CollabAcademicAgeAtCollaboration,PrePostFlag5
0,2033335000.0,1917877966,1995.0,1999.0,retracted,male,0.99,1994.0,1994.0,1.0,...,1999.0,99.0,1999.0,31,MAGFirstOrLastAuthor,28.0,senior author,1.0,5.0,post5
1,2033335000.0,2169118091,1995.0,1999.0,retracted,male,1.0,1995.0,1995.0,1.0,...,1999.0,8.0,1999.0,23,MAGFirstOrLastAuthor,28.0,senior author,0.0,4.0,post5
2,2033335000.0,275085591,1995.0,1997.0,retracted,male,0.99,1994.0,1995.0,5.0,...,1997.0,53.0,1997.0,22,MAGFirstOrLastAuthor,28.0,senior author,1.0,3.0,post5
3,2033335000.0,2111014462,1995.0,1997.0,retracted,female,0.98,1988.0,1995.0,17.0,...,1997.0,921.0,1997.0,42,MAGFirstOrLastAuthor,28.0,senior author,7.0,9.0,post5
4,2033335000.0,2622920657,1995.0,1997.0,retracted,male,0.99,1991.0,1995.0,17.0,...,1997.0,306.0,1997.0,44,MAGFirstOrLastAuthor,28.0,senior author,4.0,6.0,post5


In [101]:
df4

PrePostFlag5,MAGAID,RetractionYear,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5
0,1.910029e+07,2002.0,"{18011520, 2181345027, 2608247304, 2043645593,...","{2437904394, 2171993227, 2043645593, 220887567...",5,"{2128982626, 1863203661, 2043645593, 410625722...","{2059887737, 2186312265, 2437904394, 217199322...",23,"{18011520, 2181345027, 2608247304, 2111905563,..."
1,2.168694e+07,2008.0,"{2646743714, 2800470565, 2117660071, 269971592...","{2170694433, 2064283617, 2327654243, 247583094...",0,{},"{2330648593, 2965357269, 3081263517, 250572015...",12,"{2646743714, 2800470565, 2117660071, 269971592..."
2,2.968002e+07,1997.0,"{2309217606, 2070053831, 1588714087, 252097620...","{287126053, 1971237384, 2023717609, 2273398858...",1,{2920459384},"{1971237384, 2273398858, 1998215821, 230597927...",4,"{2520976201, 1588714087, 2309217606, 2070053831}"
3,3.343381e+07,2009.0,"{2653850944, 2127386819, 2128942892, 2113925965}","{2653850944, 2127386819, 2572203943, 212894289...",4,"{2653850944, 2127386819, 2128942892, 2113925965}","{2137538457, 2149027090, 2572203943}",0,{}
4,4.195747e+07,2015.0,"{2034626565, 1848517639, 2597654027, 203201076...","{2152585733, 2142553607, 2110808072, 294896999...",32,"{2047324288, 2133711754, 1719173006, 219254094...","{2152585733, 2142553607, 2110808072, 294896999...",83,"{2034626565, 1848517639, 2597654027, 203201076..."
...,...,...,...,...,...,...,...,...,...
1757,3.173544e+09,2011.0,"{3120802076, 2236143686}","{2096029443, 2160112133, 2158121610, 230907623...",0,{},"{2096029443, 2160112133, 2158121610, 230907623...",2,"{3120802076, 2236143686}"
1758,3.174124e+09,2004.0,"{2706053123, 2250669861, 2568589385, 222924385...","{2939265617, 2687883010, 2424699715, 2100866894}",1,{2100866894},"{2939265617, 2687883010, 2424699715}",6,"{2706053123, 2250669861, 2568589385, 222924385..."
1759,3.174448e+09,2008.0,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",1,{2561941943},"{2413204075, 2100828844, 3175667245, 299211752...",0,{}
1760,3.175436e+09,2015.0,"{1805786912, 2999619457, 2658197410, 257933920...","{2130470407, 2395301650, 1455333013, 231293572...",1,{2121913688},"{2240552385, 2130470407, 2333910471, 252033530...",14,"{1805786912, 2999619457, 2658197410, 196858720..."


In [102]:
# Let us first merge df3 and df4

df_A = df3.merge(df4, on=['MAGAID','RetractionYear'])

# Let us also create three flags checking whether current collaborator is retained, gained, or lost

df_A['CollabAIDinRetained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDRetainedW5'], 
                                          axis=1)

df_A['CollabAIDinGained'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDGainedW5'], 
                                          axis=1)

df_A['CollabAIDinLost'] = df_A.apply(lambda row: row['MAGCollabAID'] in row['CollabAIDLostW5'], 
                                          axis=1)

df_A.head(3)

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,post5,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost
0,2033335000.0,1917877966,1995.0,1999.0,retracted,male,0.99,1994.0,1994.0,1.0,...,"{1973921793, 2041262087, 2570043915, 213721346...","{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",False,True,False
1,2033335000.0,2169118091,1995.0,1999.0,retracted,male,1.0,1995.0,1995.0,1.0,...,"{1973921793, 2041262087, 2570043915, 213721346...","{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",False,True,False
2,2033335000.0,275085591,1995.0,1997.0,retracted,male,0.99,1994.0,1995.0,5.0,...,"{1973921793, 2041262087, 2570043915, 213721346...","{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",True,False,False


In [103]:
# Sensibility checks

df_A[['CollabAIDinRetained','CollabAIDinGained','CollabAIDinLost']].value_counts()

CollabAIDinRetained  CollabAIDinGained  CollabAIDinLost
False                False              True               74864
                     True               False              62430
True                 False              False              47536
Name: count, dtype: int64

In [104]:
df_A.columns, df_A.shape

(Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
        'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
        'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
        'CollabMAGCumPapersAtRetraction',
        'CollabMAGCumCitationsYearAtRetraction',
        'CollabMAGCumCitationsAtRetraction',
        'CollabMAGCumCollaboratorsYearAtRetraction',
        'CollabMAGCumCollaboratorsAtRetraction',
        'CollabMAGCumPapersYearAtCollaboration',
        'CollabMAGCumPapersAtCollaboration',
        'CollabMAGCumCitationsYearAtCollaboration',
        'CollabMAGCumCitationsAtCollaboration',
        'CollabMAGCumCollaboratorsYearAtCollaboration',
        'CollabMAGCumCollaboratorsAtCollaboration',
        'MAGAIDFirstORLastAuthorFlag', 'AcademicAgeAtRetraction',
        'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
        'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
        'NumRetentio

## DANGER ZONE!

This code removes collaborators that have academic age > 70 at the time of collaboration. 

In [105]:
#df_A[df_A.CollabAcademicAgeAtCollaboration.gt(70) & df_A.ScientistType.eq('retracted')].MAGAID.nunique()

In [106]:
#df_A = df_A[df_A.CollabAcademicAgeAtCollaboration.le(70)]

### A1: Collaborators retained: retracted vs. matched

In [107]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A1_post = df_A[df_A['PrePostFlag5']=='post5']

In [108]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A1_firstcollabs = df_A1_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A1_w_firstcollabs = df_A1_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A1_w_firstcollabs.shape

(83316, 37)

In [109]:
# Sensibility checks

df_A1_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
3472,19100288.0,18011520,2004.0,2004.0
3479,19100288.0,121410733,2004.0,2004.0
3460,19100288.0,410625722,2005.0,2005.0
3473,19100288.0,1235268530,2004.0,2004.0
3482,19100288.0,1340583028,2006.0,2006.0
3454,19100288.0,1793107545,2003.0,2003.0
3455,19100288.0,1793107545,2004.0,2003.0
3474,19100288.0,1859744180,2004.0,2004.0
3456,19100288.0,1863203661,2003.0,2003.0
3457,19100288.0,1863203661,2004.0,2003.0


In [110]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A1_w_firstcollabs_only = df_A1_w_firstcollabs[df_A1_w_firstcollabs.MAGCollaborationYear == \
                                                df_A1_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A1_w_firstcollabs_only.shape

(63167, 37)

In [111]:
df_A1_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'MAGAIDFirstORLastAuthorFlag', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRe

In [112]:
def create_stratified_dfs_retention(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinRetained', 'NumRetentionW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinRetained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='early-career author']
    df_retracted_midcareer = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_retracted_senior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='senior author']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='early-career author']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='senior author']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior, df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior


In [113]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_retention(df_A1_w_firstcollabs_only)

In [114]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(162, 86, 152, 162, 86, 152)

In [115]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_retention(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [116]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_retention(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_retention(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_retention(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
33433812,15.333333,91.000000,924.000000,76.333333
94287040,7.600000,120.400000,1556.800000,142.200000
183936737,10.666667,25.333333,1090.000000,75.333333
206603143,13.270833,104.308333,1874.112500,347.070833
298061212,17.200000,154.400000,1413.000000,755.800000
...,...,...,...,...
2987934803,16.722222,73.972222,497.777778,158.555556
2996120756,11.962500,60.300000,1022.666667,147.112500
3049602676,21.000000,87.000000,285.000000,24.000000
3052744139,8.812500,110.187500,898.125000,123.250000


In [117]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_retention = []

for exp_field in exp_fields:
    dicts_retention = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_retention['Junior'] = dict_stats_j
    dicts_retention['Mid'] = dict_stats_m
    dicts_retention['Senior'] = dict_stats_s
    
    lst_dicts_retention.append(dicts_retention)

In [118]:
pd.DataFrame(lst_dicts_retention[0])

Unnamed: 0,Junior,Mid,Senior
CollabAcademicAgeAtCollaboration_retracted_mean,12.61,14.63,14.7
CollabAcademicAgeAtCollaboration_retracted_median,12.0,14.66,15.27
CollabAcademicAgeAtCollaboration_retracted_std,8.06,6.31,6.69
CollabAcademicAgeAtCollaboration_nonretracted_mean,14.48,15.86,16.44
CollabAcademicAgeAtCollaboration_nonretracted_median,13.17,14.67,14.77
CollabAcademicAgeAtCollaboration_nonretracted_std,7.03,7.14,7.78
CollabAcademicAgeAtCollaboration_delta_mean,-1.87,-1.23,-1.74
CollabAcademicAgeAtCollaboration_pval_welch,0.026,0.232,0.038
CollabAcademicAgeAtCollaboration_CI_95lower,-3.35,-2.85,-3.25
CollabAcademicAgeAtCollaboration_CI_95upper,-0.4,0.39,-0.23


In [119]:
pd.DataFrame(lst_dicts_retention[1])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumPapersAtCollaboration_retracted_mean,68.22,76.44,61.64
CollabMAGCumPapersAtCollaboration_retracted_median,44.92,63.35,46.17
CollabMAGCumPapersAtCollaboration_retracted_std,81.33,81.4,71.74
CollabMAGCumPapersAtCollaboration_nonretracted_mean,80.12,78.95,65.82
CollabMAGCumPapersAtCollaboration_nonretracted_median,61.61,67.75,54.28
CollabMAGCumPapersAtCollaboration_nonretracted_std,120.23,60.06,60.07
CollabMAGCumPapersAtCollaboration_delta_mean,-11.9,-2.5,-4.19
CollabMAGCumPapersAtCollaboration_pval_welch,0.298,0.819,0.582
CollabMAGCumPapersAtCollaboration_CI_95lower,-34.43,-23.91,-19.38
CollabMAGCumPapersAtCollaboration_CI_95upper,10.62,18.91,11.01


In [120]:
pd.DataFrame(lst_dicts_retention[2])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCitationsAtCollaboration_retracted_mean,1375.15,1451.05,1600.02
CollabMAGCumCitationsAtCollaboration_retracted_median,405.5,650.21,551.75
CollabMAGCumCitationsAtCollaboration_retracted_std,2403.44,2106.28,6538.35
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,1356.35,1530.51,1242.4
CollabMAGCumCitationsAtCollaboration_nonretracted_median,690.48,846.9,496.91
CollabMAGCumCitationsAtCollaboration_nonretracted_std,2306.96,2496.01,1802.62
CollabMAGCumCitationsAtCollaboration_delta_mean,18.8,-79.46,357.62
CollabMAGCumCitationsAtCollaboration_pval_welch,0.943,0.822,0.517
CollabMAGCumCitationsAtCollaboration_CI_95lower,-467.11,-763.37,-712.49
CollabMAGCumCitationsAtCollaboration_CI_95upper,504.71,604.45,1427.72


In [121]:
pd.DataFrame(lst_dicts_retention[3])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,145.93,179.87,167.2
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,77.5,119.62,83.56
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,234.2,176.47,220.15
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,203.74,179.74,143.86
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,100.2,121.23,104.25
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,486.63,181.8,144.35
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,-57.81,0.12,23.34
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.174,0.996,0.275
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,-143.7,-50.63,-16.77
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,28.08,50.88,63.45


### A2: Collaborators gained: retracted vs. matched

In [122]:
#Let us now modify df_A1 such that we remove all rows with collaborations pre-retraction

df_A2_post = df_A[df_A['PrePostFlag5']=='post5']

In [123]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A2_firstcollabs = df_A2_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A2_w_firstcollabs = df_A2_post.merge(df_A2_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A2_w_firstcollabs.shape

(83316, 37)

In [124]:
# Sensibility checks

df_A2_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
3472,19100288.0,18011520,2004.0,2004.0
3479,19100288.0,121410733,2004.0,2004.0
3460,19100288.0,410625722,2005.0,2005.0
3473,19100288.0,1235268530,2004.0,2004.0
3482,19100288.0,1340583028,2006.0,2006.0
3454,19100288.0,1793107545,2003.0,2003.0
3455,19100288.0,1793107545,2004.0,2003.0
3474,19100288.0,1859744180,2004.0,2004.0
3456,19100288.0,1863203661,2003.0,2003.0
3457,19100288.0,1863203661,2004.0,2003.0


In [125]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A2_w_firstcollabs_only = df_A2_w_firstcollabs[df_A2_w_firstcollabs.MAGCollaborationYear == \
                                                df_A2_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A2_w_firstcollabs_only.shape

(63167, 37)

In [126]:
df_A2_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'MAGAIDFirstORLastAuthorFlag', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRe

In [127]:
def create_stratified_dfs_gain(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
               'CollabMAGCumPapersAtCollaboration', 'CollabMAGCumCitationsAtCollaboration',
               'CollabMAGCumCollaboratorsAtCollaboration', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtCollaboration', 'CollabAIDinGained', 'NumNewCollaboratorsW5']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi = dfi[dfi['CollabAIDinGained']]
    
    # Dividing into retracted and matched
    df_retracted = dfi[dfi.ScientistType == 'retracted']
    df_nonretracted = dfi[dfi.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the retracted scientists have matches with non zero collaborators
    df_retracted = df_retracted[df_retracted.MAGAID.isin(df_nonretracted.MAGAID.unique())]
    
    # We need to make sure that the matches of those who retained 0 collaborators are removed
    df_nonretracted = df_nonretracted[df_nonretracted.MAGAID.isin(df_retracted.MAGAID.unique())]
    
    # Dividing into seniority for retracted
    df_retracted_junior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='early-career author']
    df_retracted_midcareer = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_retracted_senior = df_retracted[df_retracted.AuthorSeniorityAtRetraction=='senior author']
    # and matched
    df_nonretracted_junior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='early-career author']
    df_nonretracted_midcareer = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='mid-career author']
    df_nonretracted_senior = df_nonretracted[df_nonretracted.AuthorSeniorityAtRetraction=='senior author']
    
    return df_retracted_junior, df_retracted_midcareer, df_retracted_senior,df_nonretracted_junior, df_nonretracted_midcareer, df_nonretracted_senior
    

In [128]:
df_rj, df_rm, df_rs, df_nrj, df_nrm, df_nrs = create_stratified_dfs_gain(df_A2_w_firstcollabs_only)

In [129]:
df_rj.MAGAID.nunique(), df_rm.MAGAID.nunique(), df_rs.MAGAID.nunique(), df_nrj.MAGAID.nunique(), df_nrm.MAGAID.nunique(), df_nrs.MAGAID.nunique()

(169, 97, 145, 169, 97, 145)

In [130]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_gain(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtCollaboration':'MatchCollabMAGCumPapersAtCollaboration',
                                    'CollabAcademicAgeAtCollaboration':'MatchCollabAcademicAgeAtCollaboration',
                                    'CollabMAGCumCitationsAtCollaboration': 'MatchCollabMAGCumCitationsAtCollaboration',
                                    'CollabMAGCumCollaboratorsAtCollaboration': 'MatchCollabMAGCumCollaboratorsAtCollaboration'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [131]:
# Now let us do the comparison

# Let us first get the mean dataframes

mean_dfrj, mean_dfnrj = get_mean_df_gain(df_rj, df_nrj)
mean_dfrm, mean_dfnrm = get_mean_df_gain(df_rm, df_nrm)
mean_dfrs, mean_dfnrs = get_mean_df_gain(df_rs, df_nrs)

mean_dfnrj

Unnamed: 0_level_0,MatchCollabAcademicAgeAtCollaboration,MatchCollabMAGCumPapersAtCollaboration,MatchCollabMAGCumCitationsAtCollaboration,MatchCollabMAGCumCollaboratorsAtCollaboration
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
21686935,43.000000,97.000000,1729.000000,223.000000
94287040,6.814815,52.925926,732.629630,54.555556
183936737,6.500000,23.681818,1075.681818,99.727273
206603143,10.027729,74.910030,1431.565466,259.011772
291910480,1.000000,2.250000,3.000000,4.500000
...,...,...,...,...
2987934803,7.289866,23.298474,223.897741,79.404396
2996120756,3.067666,12.751142,517.052081,36.953806
3037812393,11.000000,136.000000,12688.000000,481.000000
3049602676,0.000000,1.000000,0.000000,3.000000


In [132]:
exp_fields = ['CollabAcademicAgeAtCollaboration',
                      'CollabMAGCumPapersAtCollaboration',
                      'CollabMAGCumCitationsAtCollaboration',
                      'CollabMAGCumCollaboratorsAtCollaboration']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_gain = []

for exp_field in exp_fields:
    dicts_gain = {}
    
    dict_stats_j = get_stats(mean_dfrj, mean_dfnrj, exp_field)
    dict_stats_m = get_stats(mean_dfrm, mean_dfnrm, exp_field)
    dict_stats_s = get_stats(mean_dfrs, mean_dfnrs, exp_field)
    
    dicts_gain['Junior'] = dict_stats_j
    dicts_gain['Mid'] = dict_stats_m
    dicts_gain['Senior'] = dict_stats_s
    
    lst_dicts_gain.append(dicts_gain)

In [133]:
pd.DataFrame(lst_dicts_gain[0])

Unnamed: 0,Junior,Mid,Senior
CollabAcademicAgeAtCollaboration_retracted_mean,7.07,8.15,8.26
CollabAcademicAgeAtCollaboration_retracted_median,5.14,8.0,7.86
CollabAcademicAgeAtCollaboration_retracted_std,7.56,5.08,4.73
CollabAcademicAgeAtCollaboration_nonretracted_mean,7.73,7.81,8.47
CollabAcademicAgeAtCollaboration_nonretracted_median,7.5,7.32,7.74
CollabAcademicAgeAtCollaboration_nonretracted_std,5.34,4.73,4.73
CollabAcademicAgeAtCollaboration_delta_mean,-0.66,0.34,-0.21
CollabAcademicAgeAtCollaboration_pval_welch,0.355,0.63,0.712
CollabAcademicAgeAtCollaboration_CI_95lower,-1.84,-0.96,-1.23
CollabAcademicAgeAtCollaboration_CI_95upper,0.52,1.65,0.82


In [134]:
pd.DataFrame(lst_dicts_gain[1])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumPapersAtCollaboration_retracted_mean,34.08,40.45,35.27
CollabMAGCumPapersAtCollaboration_retracted_median,17.75,30.53,29.43
CollabMAGCumPapersAtCollaboration_retracted_std,45.04,33.89,27.02
CollabMAGCumPapersAtCollaboration_nonretracted_mean,36.42,32.42,34.86
CollabMAGCumPapersAtCollaboration_nonretracted_median,29.4,28.15,29.84
CollabMAGCumPapersAtCollaboration_nonretracted_std,32.08,25.92,28.91
CollabMAGCumPapersAtCollaboration_delta_mean,-2.34,8.02,0.4
CollabMAGCumPapersAtCollaboration_pval_welch,0.583,0.066,0.902
CollabMAGCumPapersAtCollaboration_CI_95lower,-10.2,-0.16,-5.8
CollabMAGCumPapersAtCollaboration_CI_95upper,5.53,16.2,6.61


In [135]:
pd.DataFrame(lst_dicts_gain[2])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCitationsAtCollaboration_retracted_mean,826.74,779.37,759.34
CollabMAGCumCitationsAtCollaboration_retracted_median,147.57,408.0,493.76
CollabMAGCumCitationsAtCollaboration_retracted_std,1988.28,942.5,869.04
CollabMAGCumCitationsAtCollaboration_nonretracted_mean,855.75,585.2,712.43
CollabMAGCumCitationsAtCollaboration_nonretracted_median,386.71,303.38,361.32
CollabMAGCumCitationsAtCollaboration_nonretracted_std,1426.11,798.07,1313.15
CollabMAGCumCitationsAtCollaboration_delta_mean,-29.01,194.17,46.91
CollabMAGCumCitationsAtCollaboration_pval_welch,0.878,0.123,0.72
CollabMAGCumCitationsAtCollaboration_CI_95lower,-387.28,-42.57,-199.63
CollabMAGCumCitationsAtCollaboration_CI_95upper,329.27,430.92,293.46


In [136]:
pd.DataFrame(lst_dicts_gain[3])

Unnamed: 0,Junior,Mid,Senior
CollabMAGCumCollaboratorsAtCollaboration_retracted_mean,95.85,133.36,157.23
CollabMAGCumCollaboratorsAtCollaboration_retracted_median,38.08,74.81,72.32
CollabMAGCumCollaboratorsAtCollaboration_retracted_std,155.45,180.11,295.06
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean,110.23,92.45,109.44
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median,63.97,57.6,72.5
CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std,144.51,154.97,139.22
CollabMAGCumCollaboratorsAtCollaboration_delta_mean,-14.38,40.91,47.79
CollabMAGCumCollaboratorsAtCollaboration_pval_welch,0.379,0.092,0.079
CollabMAGCumCollaboratorsAtCollaboration_CI_95lower,-45.05,-2.98,-3.55
CollabMAGCumCollaboratorsAtCollaboration_CI_95upper,16.3,84.8,99.12


In [137]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Junior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Junior').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))
    
# pd.DataFrame(lst_dicts_retention[0])

In [138]:
for i in range(len(lst_dicts_retention)):
    dicto_retention = lst_dicts_retention[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_retention, col)

CollabAcademicAgeAtCollaboration
& 12.61 & 14.48 & 14.63 & 15.86 & 14.7 & 16.44\\ 

& 12.0 & 13.17 & 14.66 & 14.67 & 15.27 & 14.77\\ 

& 8.06 & 7.03 & 6.31 & 7.14 & 6.69 & 7.78\\ 

& 0.026 & 0.026 & 0.232 & 0.232 & 0.038 & 0.038\\ 

CollabMAGCumPapersAtCollaboration
& 68.22 & 80.12 & 76.44 & 78.95 & 61.64 & 65.82\\ 

& 44.92 & 61.61 & 63.35 & 67.75 & 46.17 & 54.28\\ 

& 81.33 & 120.23 & 81.4 & 60.06 & 71.74 & 60.07\\ 

& 0.298 & 0.298 & 0.819 & 0.819 & 0.582 & 0.582\\ 

CollabMAGCumCitationsAtCollaboration
& 1375.15 & 1356.35 & 1451.05 & 1530.51 & 1600.02 & 1242.4\\ 

& 405.5 & 690.48 & 650.21 & 846.9 & 551.75 & 496.91\\ 

& 2403.44 & 2306.96 & 2106.28 & 2496.01 & 6538.35 & 1802.62\\ 

& 0.943 & 0.943 & 0.822 & 0.822 & 0.517 & 0.517\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 145.93 & 203.74 & 179.87 & 179.74 & 167.2 & 143.86\\ 

& 77.5 & 100.2 & 119.62 & 121.23 & 83.56 & 104.25\\ 

& 234.2 & 486.63 & 176.47 & 181.8 & 220.15 & 144.35\\ 

& 0.174 & 0.174 & 0.996 & 0.996 & 0.275 & 0.2

In [139]:
for i in range(len(lst_dicts_gain)):
    dicto_gain = lst_dicts_gain[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_gain, col)

CollabAcademicAgeAtCollaboration
& 7.07 & 7.73 & 8.15 & 7.81 & 8.26 & 8.47\\ 

& 5.14 & 7.5 & 8.0 & 7.32 & 7.86 & 7.74\\ 

& 7.56 & 5.34 & 5.08 & 4.73 & 4.73 & 4.73\\ 

& 0.355 & 0.355 & 0.63 & 0.63 & 0.712 & 0.712\\ 

CollabMAGCumPapersAtCollaboration
& 34.08 & 36.42 & 40.45 & 32.42 & 35.27 & 34.86\\ 

& 17.75 & 29.4 & 30.53 & 28.15 & 29.43 & 29.84\\ 

& 45.04 & 32.08 & 33.89 & 25.92 & 27.02 & 28.91\\ 

& 0.583 & 0.583 & 0.066 & 0.066 & 0.902 & 0.902\\ 

CollabMAGCumCitationsAtCollaboration
& 826.74 & 855.75 & 779.37 & 585.2 & 759.34 & 712.43\\ 

& 147.57 & 386.71 & 408.0 & 303.38 & 493.76 & 361.32\\ 

& 1988.28 & 1426.11 & 942.5 & 798.07 & 869.04 & 1313.15\\ 

& 0.878 & 0.878 & 0.123 & 0.123 & 0.72 & 0.72\\ 

CollabMAGCumCollaboratorsAtCollaboration
& 95.85 & 110.23 & 133.36 & 92.45 & 157.23 & 109.44\\ 

& 38.08 & 63.97 & 74.81 & 57.6 & 72.32 & 72.5\\ 

& 155.45 & 144.51 & 180.11 & 154.97 & 295.06 & 139.22\\ 

& 0.379 & 0.379 & 0.092 & 0.092 & 0.079 & 0.079\\ 



In [140]:
dicto_retention

{'Junior': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 145.93,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 77.5,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 234.2,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 203.74,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median': 100.2,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std': 486.63,
  'CollabMAGCumCollaboratorsAtCollaboration_delta_mean': -57.81,
  'CollabMAGCumCollaboratorsAtCollaboration_pval_welch': 0.174,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95lower': -143.7,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95upper': 28.08},
 'Mid': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 179.87,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 119.62,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 176.47,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 179.74,
  'CollabMAGCumCollabor

In [141]:
dicto_gain

{'Junior': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 95.85,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 38.08,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 155.45,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 110.23,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_median': 63.97,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_std': 144.51,
  'CollabMAGCumCollaboratorsAtCollaboration_delta_mean': -14.38,
  'CollabMAGCumCollaboratorsAtCollaboration_pval_welch': 0.379,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95lower': -45.05,
  'CollabMAGCumCollaboratorsAtCollaboration_CI_95upper': 16.3},
 'Mid': {'CollabMAGCumCollaboratorsAtCollaboration_retracted_mean': 133.36,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_median': 74.81,
  'CollabMAGCumCollaboratorsAtCollaboration_retracted_std': 180.11,
  'CollabMAGCumCollaboratorsAtCollaboration_nonretracted_mean': 92.45,
  'CollabMAGCumCollaborat

### A3: Collaborators retained vs lost: retracted vs. matched

In [142]:
#Let us now modify df_A3 such that we remove all rows with collaborations pre-retraction

df_A3_post = df_A[df_A['PrePostFlag5']=='post5']
df_A3_pre = df_A[df_A['PrePostFlag5']=='pre']

In [143]:
# Now we shall groupby MAGAID, MAGCollabAID, RetractionYear, and sort by MAGCollaborationYear
# Then I shall extract the earliest collaboration year post retraction

df_A3_firstcollabs = df_A3_post.groupby(['MAGAID','MAGCollabAID','RetractionYear'])['MAGCollaborationYear']\
                        .min().reset_index()\
                        .rename(columns={'MAGCollaborationYear':'FirstPostRetractionMAGCollaborationYear'})


# Now we shall merge the new column with A1

df_A3_w_firstcollabs = df_A3_post.merge(df_A1_firstcollabs,
                                   on=['MAGAID','MAGCollabAID','RetractionYear'])

df_A3_w_firstcollabs.shape


(83316, 37)

In [144]:
# Sensibility checks

df_A3_w_firstcollabs.sort_values(by=['MAGAID','MAGCollabAID','MAGCollaborationYear'])\
            [['MAGAID','MAGCollabAID','MAGCollaborationYear','FirstPostRetractionMAGCollaborationYear']].head(30)

Unnamed: 0,MAGAID,MAGCollabAID,MAGCollaborationYear,FirstPostRetractionMAGCollaborationYear
3472,19100288.0,18011520,2004.0,2004.0
3479,19100288.0,121410733,2004.0,2004.0
3460,19100288.0,410625722,2005.0,2005.0
3473,19100288.0,1235268530,2004.0,2004.0
3482,19100288.0,1340583028,2006.0,2006.0
3454,19100288.0,1793107545,2003.0,2003.0
3455,19100288.0,1793107545,2004.0,2003.0
3474,19100288.0,1859744180,2004.0,2004.0
3456,19100288.0,1863203661,2003.0,2003.0
3457,19100288.0,1863203661,2004.0,2003.0


In [145]:
# Now let us only extract rows where collaboration year is the first collaboration year

df_A3_w_firstcollabs_only = df_A3_w_firstcollabs[df_A3_w_firstcollabs.MAGCollaborationYear == \
                                                df_A3_w_firstcollabs.FirstPostRetractionMAGCollaborationYear]

df_A3_w_firstcollabs_only.shape

(63167, 37)

In [146]:
df_A3_w_firstcollabs_only.columns

Index(['MAGAID', 'MAGCollabAID', 'RetractionYear', 'MAGCollaborationYear',
       'ScientistType', 'CollabGenderizeGender', 'CollabGenderizeConfidence',
       'CollabMAGFirstPubYear', 'CollabMAGCumPapersYearAtRetraction',
       'CollabMAGCumPapersAtRetraction',
       'CollabMAGCumCitationsYearAtRetraction',
       'CollabMAGCumCitationsAtRetraction',
       'CollabMAGCumCollaboratorsYearAtRetraction',
       'CollabMAGCumCollaboratorsAtRetraction',
       'CollabMAGCumPapersYearAtCollaboration',
       'CollabMAGCumPapersAtCollaboration',
       'CollabMAGCumCitationsYearAtCollaboration',
       'CollabMAGCumCitationsAtCollaboration',
       'CollabMAGCumCollaboratorsYearAtCollaboration',
       'CollabMAGCumCollaboratorsAtCollaboration',
       'MAGAIDFirstORLastAuthorFlag', 'AcademicAgeAtRetraction',
       'AuthorSeniorityAtRetraction', 'CollabAcademicAgeAtRetraction',
       'CollabAcademicAgeAtCollaboration', 'PrePostFlag5', 'post5', 'pre',
       'NumRetentionW5', 'CollabAIDRe

In [147]:
# Finally let us merge post and pre

df_A3_post_pre = pd.concat([df_A3_w_firstcollabs_only,df_A3_pre])

df_A3_post_pre.head()

Unnamed: 0,MAGAID,MAGCollabAID,RetractionYear,MAGCollaborationYear,ScientistType,CollabGenderizeGender,CollabGenderizeConfidence,CollabMAGFirstPubYear,CollabMAGCumPapersYearAtRetraction,CollabMAGCumPapersAtRetraction,...,pre,NumRetentionW5,CollabAIDRetainedW5,CollabAIDLostW5,NumNewCollaboratorsW5,CollabAIDGainedW5,CollabAIDinRetained,CollabAIDinGained,CollabAIDinLost,FirstPostRetractionMAGCollaborationYear
2,2033335000.0,1917877966,1995.0,1998.0,retracted,male,0.99,1994.0,1994.0,1.0,...,"{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",False,True,False,1998.0
4,2033335000.0,2169118091,1995.0,1998.0,retracted,male,1.0,1995.0,1995.0,1.0,...,"{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",False,True,False,1998.0
7,2033335000.0,275085591,1995.0,1996.0,retracted,male,0.99,1994.0,1995.0,5.0,...,"{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",True,False,False,1996.0
9,2033335000.0,2111014462,1995.0,1996.0,retracted,female,0.98,1988.0,1995.0,17.0,...,"{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",True,False,False,1996.0
12,2033335000.0,2622920657,1995.0,1996.0,retracted,male,0.99,1991.0,1995.0,17.0,...,"{2041262087, 2590561305, 1940968476, 211428099...",23,"{2690057605, 2041262087, 2466141063, 230478619...","{2590561305, 1940968476, 2657016352, 265692164...",60,"{1973921793, 2308182914, 1147890307, 265602649...",True,False,False,1996.0


In [148]:
def create_stratified_dfs_a3(dfi):
    
    # This function will create 6 dataframes relevant for conducting our analysis
    # 3 of those dataframes will be for relevant columns for treatment
    # rest 3 will be average control. 
    # These will be stratified by treatment and control, and further stratified by seniority
    df_ids = pd.read_csv(INDIR_MATCHING+"/RWMAG_rematched_control_augmented_rematching_30perc.csv",
                    usecols=['MAGAID','MatchMAGAID', 'RetractionYear']).drop_duplicates()
    
    rel_cols = ['MAGAID', 'ScientistType','MAGCollabAID', 'RetractionYear',
               'CollabMAGCumPapersAtRetraction', 'CollabMAGCumCitationsAtRetraction',
               'CollabMAGCumCollaboratorsAtRetraction', 'AuthorSeniorityAtRetraction',
               'CollabAcademicAgeAtRetraction', 'CollabAIDinRetained', 'CollabAIDinLost']
    
    # Only extracting relevant cols
    dfi = dfi[rel_cols].drop_duplicates()
    
    # Only extract those collaborators that were retained
    dfi_retained = dfi[dfi['CollabAIDinRetained']]
    dfi_lost = dfi[dfi['CollabAIDinLost']]
    
    # Dividing into retracted and matched
    df_retracted_retained = dfi_retained[dfi_retained.ScientistType == 'retracted']
    df_retracted_lost = dfi_lost[dfi_lost.ScientistType == 'retracted']
    
    df_nonretracted_retained = dfi_retained[dfi_retained.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    df_nonretracted_lost = dfi_lost[dfi_lost.ScientistType == 'matched']\
                        .rename(columns={'MAGAID':'MatchMAGAID'})\
                        .merge(df_ids, on=['MatchMAGAID','RetractionYear'])
    
    # We also need to makre sure that the four groups retracted,non-retracted,retained,lost have same ids
    
    set1 = set(df_retracted_retained['MAGAID'].unique())
    set2 = set(df_retracted_lost['MAGAID'].unique())
    set3 = set(df_nonretracted_retained['MAGAID'].unique())
    set4 = set(df_nonretracted_lost['MAGAID'].unique())
    
    magaids_intersection = set1.intersection(set2, set3, set4)
    
    df_retracted_retained = df_retracted_retained[df_retracted_retained.MAGAID.isin(magaids_intersection)]
    df_retracted_lost = df_retracted_lost[df_retracted_lost.MAGAID.isin(magaids_intersection)]
    df_nonretracted_retained = df_nonretracted_retained[df_nonretracted_retained.MAGAID.isin(magaids_intersection)]
    df_nonretracted_lost = df_nonretracted_lost[df_nonretracted_lost.MAGAID.isin(magaids_intersection)]

    
    # Dividing into seniority for retracted retained
    
    dfrj_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='early-career author']
    dfrm_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='mid-career author']
    dfrs_r = df_retracted_retained[df_retracted_retained.AuthorSeniorityAtRetraction=='senior author']
    
    dfrj_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='early-career author']
    dfrm_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='mid-career author']
    dfrs_l = df_retracted_lost[df_retracted_lost.AuthorSeniorityAtRetraction=='senior author']
    
    # and matched
    dfnrj_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='early-career author']
    dfnrm_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='mid-career author']
    dfnrs_r = df_nonretracted_retained[df_nonretracted_retained.AuthorSeniorityAtRetraction=='senior author']
    
    dfnrj_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='early-career author']
    dfnrm_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='mid-career author']
    dfnrs_l = df_nonretracted_lost[df_nonretracted_lost.AuthorSeniorityAtRetraction=='senior author']
    
    return [dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l]
    

In [149]:
lst_stratified_dfs = create_stratified_dfs_a3(df_A3_post_pre)

for dfj in lst_stratified_dfs:
    print(dfj.MAGAID.nunique())
    
dfrj_r,dfrm_r,dfrs_r,dfrj_l,dfrm_l,dfrs_l,dfnrj_r,dfnrm_r,dfnrs_r,dfnrj_l,dfnrm_l,dfnrs_l = lst_stratified_dfs

146
85
152
146
85
152
146
85
152
146
85
152


In [166]:
dfrj_r

Unnamed: 0,MAGAID,ScientistType,MAGCollabAID,RetractionYear,CollabMAGCumPapersAtRetraction,CollabMAGCumCitationsAtRetraction,CollabMAGCumCollaboratorsAtRetraction,AuthorSeniorityAtRetraction,CollabAcademicAgeAtRetraction,CollabAIDinRetained,CollabAIDinLost
5012,2.703327e+09,retracted,1974151439,1999.0,110.0,4437.0,170.0,early-career author,26.0,True,False
5016,2.703327e+09,retracted,2307543124,1999.0,1.0,5.0,3.0,early-career author,2.0,True,False
6698,2.098044e+09,retracted,2104946485,2007.0,4.0,29.0,8.0,early-career author,5.0,True,False
6699,2.098044e+09,retracted,2131257641,2007.0,152.0,1344.0,180.0,early-career author,29.0,True,False
6700,2.098044e+09,retracted,2145415599,2007.0,51.0,723.0,61.0,early-career author,33.0,True,False
...,...,...,...,...,...,...,...,...,...,...,...
52252,2.114194e+09,retracted,2612188761,2015.0,4.0,9.0,17.0,early-career author,1.0,True,False
52257,2.114194e+09,retracted,2693916240,2015.0,2.0,3.0,13.0,early-career author,1.0,True,False
52420,2.569982e+09,retracted,2113686342,2007.0,38.0,4192.0,109.0,early-career author,36.0,True,False
52424,2.569982e+09,retracted,2566957369,2007.0,106.0,14090.0,250.0,early-career author,33.0,True,False


In [150]:
# Let us extract the mean dataframes and merge them for different age categories

def get_mean_df_a3(dfr, dfnr):
    mean_dfr = dfr.groupby('MAGAID')[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()
    
    mean_dfnr = dfnr.groupby(['MAGAID','MatchMAGAID'])[['CollabAcademicAgeAtRetraction',
                      'CollabMAGCumPapersAtRetraction',
                      'CollabMAGCumCitationsAtRetraction',
                      'CollabMAGCumCollaboratorsAtRetraction']].mean()\
                    .groupby('MAGAID').mean()\
                    .rename(columns={'CollabMAGCumPapersAtRetraction':'MatchCollabMAGCumPapersAtRetraction',
                                    'CollabAcademicAgeAtRetraction':'MatchCollabAcademicAgeAtRetraction',
                                    'CollabMAGCumCitationsAtRetraction': 'MatchCollabMAGCumCitationsAtRetraction',
                                    'CollabMAGCumCollaboratorsAtRetraction': 'MatchCollabMAGCumCollaboratorsAtRetraction'})
    
    return mean_dfr, mean_dfnr

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

def get_stats(dfr, dfnr, column):
    """
    This code will compute the mean, median, std dev. and p-value (as per welch test), and CIs for
    the given column
    """
    
    mean_r = dfr[column].mean()
    median_r = dfr[column].median()
    std_r = dfr[column].std()
    
    mean_nr = dfnr['Match'+column].mean()
    median_nr = dfnr['Match'+column].median()
    std_nr = dfnr['Match'+column].std()
    
    _, pval = stats.ttest_ind(dfr[column],dfnr['Match'+column], equal_var=False)
    
    lst_delta = (dfr[column]- dfnr['Match'+column]).tolist()
    delta_mean, conf_lower, conf_upper = mean_confidence_interval(lst_delta, confidence=0.95)
    
    return {column+'_retracted_mean':round(mean_r,2), 
            column+'_retracted_median':round(median_r,2), 
            column+'_retracted_std':round(std_r,2), 
            column+'_nonretracted_mean':round(mean_nr,2), 
            column+'_nonretracted_median':round(median_nr,2), 
            column+'_nonretracted_std':round(std_nr,2), 
            column+'_delta_mean':round(delta_mean,2), 
            column+'_pval_welch':round(pval,3), 
            column+'_CI_95lower':round(conf_lower,2), 
            column+'_CI_95upper':round(conf_upper,2)}

In [151]:
# Now let us do the comparison

# Let us first get the mean dataframes

# for junior
mean_dfrj_r, mean_dfnrj_r = get_mean_df_a3(dfrj_r, dfnrj_r) # for retained
mean_dfrj_l, mean_dfnrj_l = get_mean_df_a3(dfrj_l, dfnrj_l) # for lost

# for mid-rank
mean_dfrm_r, mean_dfnrm_r = get_mean_df_a3(dfrm_r, dfnrm_r) 
mean_dfrm_l, mean_dfnrm_l = get_mean_df_a3(dfrm_l, dfnrm_l)

# for senior
mean_dfrs_r, mean_dfnrs_r = get_mean_df_a3(dfrs_r, dfnrs_r)
mean_dfrs_l, mean_dfnrs_l = get_mean_df_a3(dfrs_l, dfnrs_l)


# Now let us compute differences

def compute_diff_df(df_ri, df_li, scientistType='retracted'):
    
    dfrli = df_ri.merge(df_li, right_index=True, left_index=True)
    
    if scientistType == 'matched':
        
        dfrli['MatchDiffAcademicAgeAtRetraction'] = dfrli['MatchCollabAcademicAgeAtRetraction_x'] - \
                                                dfrli['MatchCollabAcademicAgeAtRetraction_y']
        
        dfrli['MatchDiffMAGCumPapersAtRetraction'] = dfrli['MatchCollabMAGCumPapersAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumPapersAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCitationsAtRetraction'] = dfrli['MatchCollabMAGCumCitationsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCitationsAtRetraction_y']
        
        dfrli['MatchDiffMAGCumCollaboratorsAtRetraction'] = dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_x'] - \
                                                dfrli['MatchCollabMAGCumCollaboratorsAtRetraction_y']
        
        return dfrli
    
        
    dfrli['DiffAcademicAgeAtRetraction'] = dfrli['CollabAcademicAgeAtRetraction_x'] - \
                                            dfrli['CollabAcademicAgeAtRetraction_y']

    dfrli['DiffMAGCumPapersAtRetraction'] = dfrli['CollabMAGCumPapersAtRetraction_x'] - \
                                            dfrli['CollabMAGCumPapersAtRetraction_y']

    dfrli['DiffMAGCumCitationsAtRetraction'] = dfrli['CollabMAGCumCitationsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCitationsAtRetraction_y']

    dfrli['DiffMAGCumCollaboratorsAtRetraction'] = dfrli['CollabMAGCumCollaboratorsAtRetraction_x'] - \
                                            dfrli['CollabMAGCumCollaboratorsAtRetraction_y']

    return dfrli
    


In [152]:
dfrj_rMinusl = compute_diff_df(mean_dfrj_r, mean_dfrj_l)
dfnrj_rMinusl = compute_diff_df(mean_dfnrj_r, mean_dfnrj_l, scientistType='matched')


dfrm_rMinusl = compute_diff_df(mean_dfrm_r, mean_dfrm_l)
dfnrm_rMinusl = compute_diff_df(mean_dfnrm_r, mean_dfnrm_l, scientistType='matched')

dfrs_rMinusl = compute_diff_df(mean_dfrs_r, mean_dfrs_l)
dfnrs_rMinusl = compute_diff_df(mean_dfnrs_r, mean_dfnrs_l, scientistType='matched')

In [169]:
dfrj_rMinusl

Unnamed: 0_level_0,CollabAcademicAgeAtRetraction_x,CollabMAGCumPapersAtRetraction_x,CollabMAGCumCitationsAtRetraction_x,CollabMAGCumCollaboratorsAtRetraction_x,CollabAcademicAgeAtRetraction_y,CollabMAGCumPapersAtRetraction_y,CollabMAGCumCitationsAtRetraction_y,CollabMAGCumCollaboratorsAtRetraction_y,DiffAcademicAgeAtRetraction,DiffMAGCumPapersAtRetraction,DiffMAGCumCitationsAtRetraction,DiffMAGCumCollaboratorsAtRetraction
MAGAID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3.343381e+07,12.500000,39.250000,242.250000,23.000000,4.333333,9.333333,32.666667,14.000000,8.166667,29.916667,209.583333,9.000
9.428704e+07,13.000000,16.500000,161.500000,15.500000,3.300000,5.200000,14.200000,9.800000,9.700000,11.300000,147.300000,5.700
1.839367e+08,17.375000,33.500000,882.375000,67.500000,16.000000,24.625000,247.250000,105.375000,1.375000,8.875000,635.125000,-37.875
2.066031e+08,6.666667,37.416667,245.250000,252.000000,4.200000,29.000000,121.600000,312.200000,2.466667,8.416667,123.650000,-60.200
3.474066e+08,13.000000,40.000000,738.000000,40.000000,19.750000,57.000000,971.250000,49.750000,-6.750000,-17.000000,-233.250000,-9.750
...,...,...,...,...,...,...,...,...,...,...,...,...
2.955031e+09,13.560000,74.680000,1704.280000,314.520000,12.538462,43.769231,743.384615,218.000000,1.021538,30.910769,960.895385,96.520
2.987935e+09,1.666667,2.000000,3.666667,8.333333,0.666667,2.000000,0.666667,5.333333,1.000000,0.000000,3.000000,3.000
2.996121e+09,11.800000,30.600000,63.600000,82.800000,3.300000,56.600000,617.300000,126.100000,8.500000,-26.000000,-553.700000,-43.300
3.052744e+09,3.200000,4.800000,5.800000,8.800000,18.000000,113.000000,537.500000,48.000000,-14.800000,-108.200000,-531.700000,-39.200


In [70]:
exp_fields = ['DiffAcademicAgeAtRetraction',
              'DiffMAGCumPapersAtRetraction',
              'DiffMAGCumCitationsAtRetraction',
              'DiffMAGCumCollaboratorsAtRetraction']

# Now we should compute outcome variabels for each of the four experience variables.

lst_dicts_a3 = []

# For age, papers, cites, collabs
for exp_field in exp_fields:
    dicts_a3 = {}
    # we compute the stats for junior, mid, and senior
    dict_stats_j = get_stats(dfrj_rMinusl, dfnrj_rMinusl, exp_field)
    dict_stats_m = get_stats(dfrm_rMinusl, dfnrm_rMinusl, exp_field)
    dict_stats_s = get_stats(dfrs_rMinusl, dfnrs_rMinusl, exp_field)
    
    dicts_a3['Junior'] = dict_stats_j
    dicts_a3['Mid'] = dict_stats_m
    dicts_a3['Senior'] = dict_stats_s
    
    lst_dicts_a3.append(dicts_a3)

In [71]:
pd.DataFrame(lst_dicts_a3[0])

Unnamed: 0,Junior,Mid,Senior
DiffAcademicAgeAtRetraction_retracted_mean,3.05,3.19,-0.73
DiffAcademicAgeAtRetraction_retracted_median,2.47,2.73,-0.01
DiffAcademicAgeAtRetraction_retracted_std,9.78,6.03,5.81
DiffAcademicAgeAtRetraction_nonretracted_mean,3.89,4.08,0.54
DiffAcademicAgeAtRetraction_nonretracted_median,3.18,3.54,0.06
DiffAcademicAgeAtRetraction_nonretracted_std,6.83,6.59,7.08
DiffAcademicAgeAtRetraction_delta_mean,-0.84,-0.89,-1.28
DiffAcademicAgeAtRetraction_pval_welch,0.397,0.362,0.087
DiffAcademicAgeAtRetraction_CI_95lower,-2.68,-2.39,-2.78
DiffAcademicAgeAtRetraction_CI_95upper,1.0,0.62,0.22


In [72]:
pd.DataFrame(lst_dicts_a3[1])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumPapersAtRetraction_retracted_mean,26.23,22.86,13.25
DiffMAGCumPapersAtRetraction_retracted_median,9.39,9.19,6.02
DiffMAGCumPapersAtRetraction_retracted_std,71.83,68.39,47.8
DiffMAGCumPapersAtRetraction_nonretracted_mean,28.85,29.0,19.17
DiffMAGCumPapersAtRetraction_nonretracted_median,16.53,21.5,9.64
DiffMAGCumPapersAtRetraction_nonretracted_std,100.54,45.09,52.6
DiffMAGCumPapersAtRetraction_delta_mean,-2.62,-6.14,-5.92
DiffMAGCumPapersAtRetraction_pval_welch,0.798,0.491,0.305
DiffMAGCumPapersAtRetraction_CI_95lower,-23.02,-24.29,-18.0
DiffMAGCumPapersAtRetraction_CI_95upper,17.78,12.01,6.16


In [73]:
pd.DataFrame(lst_dicts_a3[2])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumCitationsAtRetraction_retracted_mean,439.08,33.32,238.65
DiffMAGCumCitationsAtRetraction_retracted_median,60.42,-4.98,-0.76
DiffMAGCumCitationsAtRetraction_retracted_std,1988.35,1621.72,3374.01
DiffMAGCumCitationsAtRetraction_nonretracted_mean,332.56,423.13,247.34
DiffMAGCumCitationsAtRetraction_nonretracted_median,115.26,118.41,-20.15
DiffMAGCumCitationsAtRetraction_nonretracted_std,1576.83,1646.59,1370.7
DiffMAGCumCitationsAtRetraction_delta_mean,106.53,-389.81,-8.69
DiffMAGCumCitationsAtRetraction_pval_welch,0.612,0.122,0.977
DiffMAGCumCitationsAtRetraction_CI_95lower,-299.5,-948.01,-603.62
DiffMAGCumCitationsAtRetraction_CI_95upper,512.56,168.4,586.24


In [74]:
pd.DataFrame(lst_dicts_a3[3])

Unnamed: 0,Junior,Mid,Senior
DiffMAGCumCollaboratorsAtRetraction_retracted_mean,35.47,11.1,20.3
DiffMAGCumCollaboratorsAtRetraction_retracted_median,13.65,15.0,3.73
DiffMAGCumCollaboratorsAtRetraction_retracted_std,227.24,136.11,147.36
DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean,59.86,58.24,25.48
DiffMAGCumCollaboratorsAtRetraction_nonretracted_median,21.39,24.25,13.4
DiffMAGCumCollaboratorsAtRetraction_nonretracted_std,341.14,136.26,124.14
DiffMAGCumCollaboratorsAtRetraction_delta_mean,-24.4,-47.14,-5.18
DiffMAGCumCollaboratorsAtRetraction_pval_welch,0.473,0.025,0.741
DiffMAGCumCollaboratorsAtRetraction_CI_95lower,-94.69,-90.16,-37.52
DiffMAGCumCollaboratorsAtRetraction_CI_95upper,45.9,-4.12,27.16


In [75]:
def create_latex_for_filling(dicto, col):
    
    def create_string(metric):
        string = ""
        if metric == 'pval_welch':
            string = "& " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Junior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Mid').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                " & " + \
                str(dicto.get('Senior').get(col+"_"+metric)) + \
                "\\\ \n"
        else:
            string = "& " + \
                    str(dicto.get('Junior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Junior').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Mid').get(col+"_nonretracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_retracted_"+metric)) + \
                    " & " + \
                    str(dicto.get('Senior').get(col+"_nonretracted_"+metric)) + \
                    "\\\ \n"
        
        
        
        return string
    
    print(create_string("mean"))
    print(create_string("median"))
    print(create_string("std"))
    print(create_string("pval_welch"))



for i in range(len(lst_dicts_a3)):
    dicto_did = lst_dicts_a3[i]
    col = exp_fields[i]
    print(col)
    create_latex_for_filling(dicto_did, col)

DiffAcademicAgeAtRetraction
& 3.05 & 3.89 & 3.19 & 4.08 & -0.73 & 0.54\\ 

& 2.47 & 3.18 & 2.73 & 3.54 & -0.01 & 0.06\\ 

& 9.78 & 6.83 & 6.03 & 6.59 & 5.81 & 7.08\\ 

& 0.397 & 0.397 & 0.362 & 0.362 & 0.087 & 0.087\\ 

DiffMAGCumPapersAtRetraction
& 26.23 & 28.85 & 22.86 & 29.0 & 13.25 & 19.17\\ 

& 9.39 & 16.53 & 9.19 & 21.5 & 6.02 & 9.64\\ 

& 71.83 & 100.54 & 68.39 & 45.09 & 47.8 & 52.6\\ 

& 0.798 & 0.798 & 0.491 & 0.491 & 0.305 & 0.305\\ 

DiffMAGCumCitationsAtRetraction
& 439.08 & 332.56 & 33.32 & 423.13 & 238.65 & 247.34\\ 

& 60.42 & 115.26 & -4.98 & 118.41 & -0.76 & -20.15\\ 

& 1988.35 & 1576.83 & 1621.72 & 1646.59 & 3374.01 & 1370.7\\ 

& 0.612 & 0.612 & 0.122 & 0.122 & 0.977 & 0.977\\ 

DiffMAGCumCollaboratorsAtRetraction
& 35.47 & 59.86 & 11.1 & 58.24 & 20.3 & 25.48\\ 

& 13.65 & 21.39 & 15.0 & 24.25 & 3.73 & 13.4\\ 

& 227.24 & 341.14 & 136.11 & 136.26 & 147.36 & 124.14\\ 

& 0.473 & 0.473 & 0.025 & 0.025 & 0.741 & 0.741\\ 



In [76]:
dicto_did

{'Junior': {'DiffMAGCumCollaboratorsAtRetraction_retracted_mean': 35.47,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_median': 13.65,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_std': 227.24,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean': 59.86,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_median': 21.39,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_std': 341.14,
  'DiffMAGCumCollaboratorsAtRetraction_delta_mean': -24.4,
  'DiffMAGCumCollaboratorsAtRetraction_pval_welch': 0.473,
  'DiffMAGCumCollaboratorsAtRetraction_CI_95lower': -94.69,
  'DiffMAGCumCollaboratorsAtRetraction_CI_95upper': 45.9},
 'Mid': {'DiffMAGCumCollaboratorsAtRetraction_retracted_mean': 11.1,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_median': 15.0,
  'DiffMAGCumCollaboratorsAtRetraction_retracted_std': 136.11,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_mean': 58.24,
  'DiffMAGCumCollaboratorsAtRetraction_nonretracted_median': 24.25,
  'DiffMAGCumCollaboratorsAtRetr

# Preprocessing dictionaries for plots

In [77]:
expfield_categories = ['Academic Age','Number of Papers',
                       'Number of Citations', 'Number of Collaborators']

master_dict = {}

master_dict['Retention'] = {}

for i in range(len(expfield_categories)):
    master_dict['Retention'][expfield_categories[i]] = lst_dicts_retention[i]

master_dict['Gain'] = {}

for i in range(len(expfield_categories)):
    master_dict['Gain'][expfield_categories[i]] = lst_dicts_gain[i]
    
master_dict['DiD'] = {}

for i in range(len(expfield_categories)):
    master_dict['DiD'][expfield_categories[i]] = lst_dicts_a3[i]

In [78]:
master_dict.keys()

dict_keys(['Retention', 'Gain', 'DiD'])

In [79]:
master_dict

{'Retention': {'Academic Age': {'Junior': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.61,
    'CollabAcademicAgeAtCollaboration_retracted_median': 12.0,
    'CollabAcademicAgeAtCollaboration_retracted_std': 8.06,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.48,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 13.17,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 7.03,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.87,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.026,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -3.35,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.4},
   'Mid': {'CollabAcademicAgeAtCollaboration_retracted_mean': 14.63,
    'CollabAcademicAgeAtCollaboration_retracted_median': 14.66,
    'CollabAcademicAgeAtCollaboration_retracted_std': 6.31,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.86,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.67,
    'CollabAcademicAgeAt

In [80]:
pd.DataFrame.from_dict(master_dict)

Unnamed: 0,Retention,Gain,DiD
Academic Age,{'Junior': {'CollabAcademicAgeAtCollaboration_...,{'Junior': {'CollabAcademicAgeAtCollaboration_...,{'Junior': {'DiffAcademicAgeAtRetraction_retra...
Number of Papers,{'Junior': {'CollabMAGCumPapersAtCollaboration...,{'Junior': {'CollabMAGCumPapersAtCollaboration...,{'Junior': {'DiffMAGCumPapersAtRetraction_retr...
Number of Citations,{'Junior': {'CollabMAGCumCitationsAtCollaborat...,{'Junior': {'CollabMAGCumCitationsAtCollaborat...,{'Junior': {'DiffMAGCumCitationsAtRetraction_r...
Number of Collaborators,{'Junior': {'CollabMAGCumCollaboratorsAtCollab...,{'Junior': {'CollabMAGCumCollaboratorsAtCollab...,{'Junior': {'DiffMAGCumCollaboratorsAtRetracti...


In [81]:
def save_dict(dicto, fname):
    import pickle 

    with open(fname, 'wb') as f:
        pickle.dump(dicto, f)
        
def read_dict(fname):
    import pickle
    
    with open(fname, 'rb') as f:
        loaded_dict = pickle.load(f)
        return loaded_dict

In [82]:
OUTDIR = "/Users/sm9654/desktop/NYUAD/nyuad-research/retraction_openalex/retraction_effects_on_academic_careers/data/plot_data/"

save_dict(master_dict, OUTDIR+"/collaborator_chars_byAge_firstlastauthors.pkl")



In [83]:
dict_temp = read_dict(OUTDIR+"/collaborator_chars_byAge_firstlastauthors.pkl")
dict_temp

{'Retention': {'Academic Age': {'Junior': {'CollabAcademicAgeAtCollaboration_retracted_mean': 12.61,
    'CollabAcademicAgeAtCollaboration_retracted_median': 12.0,
    'CollabAcademicAgeAtCollaboration_retracted_std': 8.06,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 14.48,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 13.17,
    'CollabAcademicAgeAtCollaboration_nonretracted_std': 7.03,
    'CollabAcademicAgeAtCollaboration_delta_mean': -1.87,
    'CollabAcademicAgeAtCollaboration_pval_welch': 0.026,
    'CollabAcademicAgeAtCollaboration_CI_95lower': -3.35,
    'CollabAcademicAgeAtCollaboration_CI_95upper': -0.4},
   'Mid': {'CollabAcademicAgeAtCollaboration_retracted_mean': 14.63,
    'CollabAcademicAgeAtCollaboration_retracted_median': 14.66,
    'CollabAcademicAgeAtCollaboration_retracted_std': 6.31,
    'CollabAcademicAgeAtCollaboration_nonretracted_mean': 15.86,
    'CollabAcademicAgeAtCollaboration_nonretracted_median': 14.67,
    'CollabAcademicAgeAt