# Setup

Importing libraries listed in `requirements.txt` for plotting (`potly`) and data (`pandas`). To be able to import `.xlsx` and `.xls` Excel files, both `xlrd` and `openpyxl` libraries are installed. To be able to visualize figures in Jupyter, `nbformat` is installed.

In [1]:
pip install -r requirements.txt

[33mYou are using pip version 18.1, however version 21.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from pathlib import Path
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

Set the source directory for the data:

In [3]:
DATA_DIR = "data_anonym"  # public
# DATA_DIR = "./data" # not public

The list of journals (using PubMed abbreviation) of the Review Commons consortium

In [4]:
REV_COM_CONSORTIUM = [
    "Biol Open", "Development", "Dis Model Mech", "Elife", "EMBO J", "EMBO Mol Med", "EMBO Rep", "EMBO Rep", "J Cell Sci",
    "J Cell Sci", "J Cell Sci", "Mol Syst Biol", "Mol Syst Biol", "PLoS Comput Biol", "PLoS Genet", "PLoS One", "PLoS Pathog",
]
assert len(REV_COM_CONSORTIUM) == 17  # make sure we did not forget anybody!

The date limit for the analysis. Only manuscript submitted BEFORE this date will be included.

In [5]:
LIMIT_DATE = '2020-06-30'

# Import data

The analyses involve data from 2 sources:

- the eJP editorial system: SQL queries used to extract relevant data are provided in `ejp_queries/`
- Automated MatchPub2 analysis (https://github.com/embo-press/matchpub): used to retrieve automatically published papers matching with submissions.

All the data required for the analysis are collected in `data/`.

Import results from ejp query `ejp_queries/ref_report_content.sql` to analyze referee report content. 

In [6]:
ref_reports = pd.read_excel(f'{DATA_DIR}/query_tool_referee_report_content.xls', skiprows=3, header=0)
ref_reports.columns

Index(['journal', 'manuscript_nm', 'num_reports', 'review_len',
       'avg_time_to_secure_rev', 'min_time_to_secure_rev', 'review_duration'],
      dtype='object')

Import results from ejp query `ejp_queries/time_to_ref_preprint.sql` that extracts the time to post the reviews next to the preprint ('refereed preprint').

In [7]:
ref_pre_print = pd.read_excel(f'{DATA_DIR}/query_tool_time_to_ref_preprint_rc.xls', skiprows=3, header=0)
ref_pre_print.columns

Index(['manuscript_nm', 'rev_no', 'sub_date', 'posting_date',
       'time_to_ref_preprint'],
      dtype='object')

Import Matchpub2 analysis for Review Commons submission to retrieve articles published within and outside of the consortium of affiliate journals.

**Note**: The table was curated manually to remove one false positive.

**Note**: a column is added to indicate if a journal belongs or not to the Review Commons consortium (column `in_group` with value `'y'|'n'`) and to validate the retrieved articles (column `validated` with value `'y'|'n'`)

In [8]:
rev_com_matchpub = pd.read_excel(f'{DATA_DIR}/revcom-found-2021-02-23-11-05-35.xlsx', engine='openpyxl', header=0)
rev_com_matchpub['sub_date'] = rev_com_matchpub['sub_date'].astype('datetime64[ns]')  # make sure of proper datetime type
rev_com_matchpub['pub_date'] = rev_com_matchpub['pub_date'].astype('datetime64[ns]')  # make sure of proper datetime type
# add a columns in_group to label papers published in a journal of the Revew Commons consortium
rev_com_matchpub['in_group'] = 'n'
rev_com_matchpub.loc[rev_com_matchpub['journal'].isin(REV_COM_CONSORTIUM), 'in_group'] = 'y'
rev_com_matchpub.columns

Index(['Unnamed: 0', 'manuscript_nm', 'sub_date', 'editor', 'decision',
       'journal', 'pub_date', 'title_score', 'author_score',
       'min_time_to_secure_rev', 'avg_time_to_secure_rev', 'referee_number',
       'in_group'],
      dtype='object')

Import Matchpub analyses to retrieve EMBO Press 2018 submissions published in EMBO Press journals and elsewhere. There will be used to measure lower bound for the time to publish in a journal.

In [9]:
emboj = pd.read_excel(f'{DATA_DIR}/emboj_2018-found-2021-02-23-11-46-42.xlsx', engine='openpyxl')
embor = pd.read_excel(f'{DATA_DIR}/embor_2018-found-2021-02-23-12-03-02.xlsx', engine='openpyxl')
embomolmed = pd.read_excel(f'{DATA_DIR}/embomolmed_2018-found-2021-02-23-12-41-59.xlsx', engine='openpyxl')
msb = pd.read_excel(f'{DATA_DIR}/msb_2018-found-2021-02-23-12-12-45.xlsx', engine='openpyxl')
# fix datetime data type
def datetime_dtype(df):
    df['sub_date'] = df['sub_date'].astype('datetime64[ns]')
    df['pub_date'] = df['pub_date'].astype('datetime64[ns]')
for j in [emboj, embor, embomolmed, msb]:
    datetime_dtype(j)
    print(f"{len(j)} rows")

1235 rows
1001 rows
717 rows
317 rows


Import data from ejp query `aggreg_outcome.sql` on transfer outcomes to affiliate journals and of journal decisions. 

**Note**: Restrict this dataframe to proper outcomes from manuscripts that have undergone initial editorial selection or peer review, thereby excluding submission pending for initial ed selection or currently under review


In [10]:
outcomes = pd.read_excel(f'{DATA_DIR}/query_tool_transfer_outcome_by_manu_rc.xls', engine='xlrd', skiprows=3, header=0)
outcomes['sub_date'] = outcomes['sub_date'].astype('datetime64[ns]') # fix datetime data type
# restrict proper outcomes to manuscript that have undergone ed selection or peer review 
outcomes = outcomes[
    (outcomes['rev_com_decision'] == "rejected before review") |  
    (outcomes['rev_com_decision'] == "suggest posting of reviews")
]
outcomes = outcomes[outcomes['sub_date'] <= LIMIT_DATE]
outcomes.columns

Index(['manuscript_nm', 'sub_date', 'preprint', 'rev_com_decision', 'journal',
       'transfer_response', 'num_transfers', 'journal_decision'],
      dtype='object')

Import data from ejp query `rev_com_accepted.sql` on manuscripts accepted for publication in one of the consortium's journal.

In [11]:
accepted = pd.read_excel(f'{DATA_DIR}/query_tool_accepted_rc_manu.xls', engine='xlrd', skiprows=3, header=0)
accepted.columns

Index(['manuscript_nm', 'sub_date', 'journal'], dtype='object')

# Analysis

## Referee report content

Plot distribution of the lenght of referee reports. Columns from `ref_reports` used:
- `journal`: journal title
- `review_len`: length of the reviews in bytes

In [12]:
ref_reports.columns

Index(['journal', 'manuscript_nm', 'num_reports', 'review_len',
       'avg_time_to_secure_rev', 'min_time_to_secure_rev', 'review_duration'],
      dtype='object')

In [13]:
fig = px.box(
    ref_reports,
    y="review_len",
    x="journal",
    category_orders={"journal": ['emboj', 'embor', 'msb', 'embomolmed', 'reviewcommons']},
    color_discrete_sequence=px.colors.qualitative.G10,
    points="all",
    title="Distribution of the length of individual referee reports",
    labels={"review_len": "referee report length [bytes]"},
    color="journal"
)
fig.update_traces(
    marker={"opacity": 0.2}
)
fig.show()
fig.write_image("./img/ref_report_len.png")

<img src="img/ref_report_len.png">

## Time to public peer-reviewed research

Comparison of time to publish a paper in a journal, with the time to post a Review Commons refereed preprint and time to publish a journal paper through the Review Commons pipeline. 

Use the `emboj`, `embor`, `msb`, `embomolmed` dataframes to computed time to publish by substracting values of publication date `pub_data` from submission date `sub_date`.

In [41]:
# simple function to compute time to publish for a journal-specific dataframe
def time_diff(df: pd.DataFrame) -> pd.DataFrame:
    df['time_to_publish'] = df['pub_date'] - df['sub_date']
    df['time_to_publish'] = df['time_to_publish'].dt.days  # convert time interval in unit of days
    df.loc[df['time_to_publish'] < 0, 'time_to_publish'] = None  # eliminate pot false positives from MatchPub results with data prior to sub date
    return df
# process our journals
df_list = [time_diff(pd) for pd in [emboj, embor, msb, embomolmed]]
# concatenate the individual tables in a single one
time_to_paper = pd.concat(df_list)
time_to_paper['type'] = 'Classical Journals'   # useful for later when plotting as a function of 'type' of publishing channel
# little check to see if all seems ok:
time_to_paper['color_index'] = "journals-" + time_to_paper['decision'] # trick to color accepted vs rejected journal papers

The `ref_pre_print` dataframe contains already the time to post the reviews as extracted from the editorial system. 

In [42]:
ref_pre_print['type'] = 'Rev Com - Refereed Preprints'  # useful for later when plotting as a function of 'type' of publishing channel
ref_pre_print.rename(columns={'time_to_ref_preprint': 'time_to_publish'}, inplace=True)  # rename column to be consisten with the other dataframes
ref_pre_print['color_index'] = 'refereed preprints'  # single color
ref_pre_print.columns

Index(['manuscript_nm', 'rev_no', 'sub_date', 'posting_date',
       'time_to_publish', 'type', 'color_index'],
      dtype='object')

Use the MatchPub2 results for ReviewCommons to calculate the time to publish a paper in a journal of the consortium. 
Column used:
- `sub_date`: submission date to Review ReviewCommons
- `pub_date`: publication date of the matching published article as fournd by MatchPub2
- `in_group`: whether the journal is within the consoritum (`y`) or outside of the consortium (`n`)

In this analysis we restrict to journals that are part of the Review Commons consortium. 

In [48]:
revcom_j = rev_com_matchpub[rev_com_matchpub['in_group'] == 'y']  # restrict to journals member of the Rev Com consortium
revcom_j['time_to_publish'] = rev_com_matchpub['pub_date'] -  rev_com_matchpub['sub_date']
revcom_j['time_to_publish'] = revcom_j['time_to_publish'].dt.days
revcom_j['type'] = 'Rev Com - Affiliate Journals'   # useful for later when plotting as a function of 'type' of publishing channel
revcom_j['color_index'] = 'rev com journal'  # single color


Concatenate:
- `time_to_paper` that give time to publish for journal articles, 
- `ref_pre_preprint` with time to post refereed preprint and 
- `revcom_j` that give time to publish a journal article through the Review Commons transfer system.

In [49]:
t2p = pd.concat([
    time_to_paper[['type', 'time_to_publish', 'color_index']],
    ref_pre_print[['type', 'time_to_publish', 'color_index']],
    revcom_j[['type', 'time_to_publish', 'color_index']]
])
t2p.columns
t2p['type'].unique()

array(['Classical Journals', 'Rev Com - Refereed Preprints',
       'Rev Com - Affiliate Journals'], dtype=object)

Plot the distributions of time to public peer reviewed research for journals, including papers accepted in EMBO Press journals and rejected papers published elsewhere, for Review Commons refereed preprints and for papers peer reviewed by Review Commons and published in a journals of the consortium.

In [50]:
fig = px.box(
    t2p,
    y="time_to_publish",
    x="type",
    points="all",
    color="type",
    color_discrete_map={
        "Classical Journal": "darkslateblue",
        # "journals-accepted": "darkgreen",
        "Rev Com - Refereed Preprints": "crimson",
        "Rev Com - Affiliate Journal": "dodgerblue"
    },
    labels = {
        "time_to_publish": 'time from submission to public peer reviewed version [days]',
        "type": "",
    
    },
    title="Accelerating the dissemination of peer reviewed research",
    height=800, width=700
)
fig.update_traces(
    marker={
        "opacity": 0.2
    }
)
fig.show()
fig.write_image("./img/time_to_reviewed_res.png")

<img src="img/time_to_reviewed_res.png">

## Current manuscript status

oVisualization of the current status of manuscirpts submitted to Review Commons to analyze the fraction of papers accepted in a journal of the consortium, editorially rejected by Review Commons or that are still in 'transit' through the transfer pipeline. Excluded from this analysis are papers that are 'pending' i.e. that are stiull under editorial consideration by Review Commons or that are currently under review.

Each submission is given its *current* status:
- `rejected before review`: manuscript rejected by Review Commons before review.
- `0 transfer`: manuscripts that have not yet been transferred to any Review Commons journal.
- `1 transfer`, `2 transfers`, `3 transfers`, `4 transfers`: current number of transfers attempted for pending manuscirpts that were NOT yet accepted.
- `accept`: manuscript that were accepted for publication in any of the Review Commons consortium journals.

In [19]:
 # clean up rare repeated decision
outcomes.loc[outcomes['journal_decision'].str.contains('accept') == True, 'journal_decision'] = 'accept' 
# accepted manuscript
outcomes.loc[outcomes['journal_decision'] == 'accept', 'status'] = 'accept'
# just in case there are manuscript without Review Commons decision or under review; should be zero
outcomes.loc[outcomes['rev_com_decision'].isnull(), 'status'] = 'pending'
# manuscript reject befoe review by Review Commons
outcomes.loc[outcomes['rev_com_decision'] == 'rejected before review', 'status'] = 'rejected before review'
# zero transfers would include both editoriall rejects and post review manu; restricting to post review here
outcomes.loc[(outcomes['rev_com_decision'] == 'suggest posting of reviews') & (outcomes['num_transfers'] == 0), 'status'] = '0 transfer'
# current number of transfer attempts for manuscript that are still pending in the journal pipeline
outcomes.loc[(outcomes['journal_decision'] != 'accept') & (outcomes['num_transfers'] == 1), 'status'] = '1 transfer'
outcomes.loc[(outcomes['journal_decision'] != 'accept') & (outcomes['num_transfers'] == 2), 'status'] = '2 transfers'
outcomes.loc[(outcomes['journal_decision'] != 'accept') & (outcomes['num_transfers'] == 3), 'status'] = '3 transfers'
outcomes.loc[(outcomes['journal_decision'] != 'accept') & (outcomes['num_transfers'] == 4), 'status'] = '4 transfers'
viz = outcomes.groupby('status').count()  # makes status index
# order the rows for visualization
viz = viz.loc[['accept', 'rejected before review', '0 transfer', '1 transfer', '2 transfers', '3 transfers', '4 transfers'], 'manuscript_nm']
# re-inserts 'status' as column so that can be used a facet for viz
viz = viz.reset_index()
viz


Unnamed: 0,status,manuscript_nm
0,accept,83
1,rejected before review,53
2,0 transfer,8
3,1 transfer,29
4,2 transfers,20
5,3 transfers,15
6,4 transfers,8


In [20]:
fig = go.Figure(
    data=[go.Pie(
        labels=list(viz['status']),
        values=list(viz['manuscript_nm']),
        pull=[0.1, 0, 0, 0, 0, 0, 0],
        sort=False,
        direction='clockwise',
    )],
    layout=go.Layout(
        height=600, width=600,
        title_text="Current manuscript status",
        title_font_size=30
    )
)
colors = {
    "rejected before review": "darkred",
    "0 transfer": "lightgray",
    "1 transfer": "gainsboro",
    "2 transfers": "darkgrey",
    "3 transfers": "grey",
    "4 transfers": "dimgrey",
    "accept": "forestgreen"
}
fig.update_traces(

    textposition='outside',
    textinfo='label+value+percent',
    textfont_size=12,
    marker=dict(
        # line=dict(
        #     width=0.5,
        #     color='White'
        # ),
        colors=[colors[k] for k in list(viz['status'])]
    )
)
fig.show()
fig.write_image("./img/transfer_status.png")

<img src="img/transfer_status.png">

## Accepted manuscripts

Distribution of the journals having accepted Review Commons manuscripts.

In [21]:
accepted_by_journal = accepted.groupby('journal').count()
accepted_by_journal = accepted_by_journal['manuscript_nm']
accepted_by_journal = accepted_by_journal.loc[[
    'plosbio', 'plosgen', 'plospath', 'ploscomp', 'plosone',
    'emboj', 'er', 'msb', 'emm', 'lsa',
    'elife',
    'jcs', 'dev', 'dmm', 'biolopen',
    'jcb',
    'mboc'
]]
accepted_by_journal

journal
plosbio      6
plosgen      9
plospath     8
ploscomp     4
plosone      2
emboj       11
er          16
msb          4
emm          2
lsa          5
elife       23
jcs         12
dev          6
dmm          3
biolopen     2
jcb          4
mboc         4
Name: manuscript_nm, dtype: int64

In [22]:
fig = go.Figure(
    data=[go.Pie(
        labels=list(accepted_by_journal.index),
        values=list(accepted_by_journal),
        sort=False,
        direction='clockwise',   
    )]
)
colors = {
    'plosbio': 'maroon', 'plosgen': 'darkred', 'plospath': 'brown', 'ploscomp': 'firebrick', 'plosone':  'indianred',
    'emboj': "darkgreen", 'er': 'forestgreen', 'msb': "limegreen", 'emm': 'lightgreen', 'lsa': "palegreen",
    'elife': 'gold',
    'jcs': 'midnightblue', 'dev': 'mediumblue', 'dmm': 'dodgerblue', 'biolopen': 'skyblue',
    'jcb': 'peru',
    'mboc': 'rebeccapurple'
}
fig.update_traces(
    textposition='outside',
    textinfo='label+percent',
    textfont_size=10,
    marker={
        "line":{
            "width": 0.5,
            "color": 'White'
        },
        "colors": [colors[k] for k in list(accepted_by_journal.index)]
    }
)
fig.show()
fig.write_image("./img/accepted_manu.png")

<img src="img/accepted_manu.png">

## Manuscript flow

To visualize the flow of manuscript through the Review Commons system, the following quantities are computed:
- preprint_available: submissions for which a preprint is available
- no_prepint: submission for which there is no preprint
- revcom2rejected: submission rejected from Rev Com (these are also, by definition, never transferred)
- revcom2one: submissions transferred to frst journals (all submissions - submission with zero transfers, including ed rej)
- rejected2inside: rejected by Rev Com and published inside the consortium
- rejected2outside: rejected by Rev Com and published outside of the consortium
- one2two: submission transferred to a second journal
- one2accept_inside: submission accepted after first transfer in a consortium journal
- one2accept_outside: submission accepted after first transfer in a consortium journal
- two2three, two2accept_inside, two2accept_outside: same with flux after two transfers to third transfer or to acceptance
- three2four, three2accept_inside, three2accept_outside: same with flux after three transfer to fourth or to acceptance
- four2accept_inside, four2accept_outside: flux from terminal fourth transfer to acceptance.


Count manuscripts for which there is a preprint and those for which there is no preprint.

In [23]:
preprint_availability = outcomes.groupby('preprint').count()
preprint_availability = preprint_availability['manuscript_nm']
preprint_available = preprint_availability['Yes']
no_preprint = preprint_availability['No']
preprint_available, no_preprint

(105, 111)

Review Commons editorial rejections.

In [24]:
ed_selection = outcomes.groupby('rev_com_decision').count()
ed_selection = ed_selection['manuscript_nm']
ed_rejected = ed_selection['rejected before review']
ed_rejected

53

From MatchPub2 results, retrieve the number of editorially rejected manuscript eventually published inside and outside the group of Rev Com journals.

In [25]:
published_after_ed_rej = rev_com_matchpub.query("(decision == 'rejected before review')")
published_after_ed_rej = published_after_ed_rej.groupby('in_group').count()
published_after_ed_rej = published_after_ed_rej['manuscript_nm']
published_after_ed_rej

in_group
n    21
y     6
Name: manuscript_nm, dtype: int64

Determine the number of transfers for manuscripts published outside of the Review Commons consortium.

In [26]:
published_outside = rev_com_matchpub.loc[
    rev_com_matchpub["in_group"] == 'n',
    ['manuscript_nm', 'journal', 'in_group']
]
published_outside.rename(columns={'journal': 'external_jou'}, inplace=True)
published_outside.columns

Index(['manuscript_nm', 'external_jou', 'in_group'], dtype='object')

In [27]:
merged = pd.merge(
    left=outcomes,
    right=published_outside,
    on='manuscript_nm',
    how='left'
)
merged

Unnamed: 0,manuscript_nm,sub_date,preprint,rev_com_decision,journal,transfer_response,num_transfers,journal_decision,status,external_jou,in_group
0,RC-2019-00104,2019-12-10 08:15:27.033,No,rejected before review,,,0,,rejected before review,Biomolecules,n
1,RC-2019-00106,2019-12-10 08:17:21.780,No,suggest posting of reviews,"er, elife, jcb","reject, reject, reject",3,,3 transfers,,
2,RC-2019-00107,2019-12-10 08:18:27.963,Yes,rejected before review,,,0,,rejected before review,,
3,RC-2019-00109,2019-12-11 07:52:12.573,Yes,rejected before review,,,0,,rejected before review,,
4,RC-2019-00110,2019-12-11 07:53:12.143,No,suggest posting of reviews,ploscomp,reject,1,,1 transfer,BMC Biol,n
...,...,...,...,...,...,...,...,...,...,...,...
211,RC-2020-00342,2020-06-25 03:11:16.983,No,suggest posting of reviews,emm,consider,1,accept,accept,,
212,RC-2020-00343,2020-06-26 04:23:04.497,No,suggest posting of reviews,emm,consider,1,,1 transfer,,
213,RC-2020-00344,2020-06-26 04:27:44.887,Yes,suggest posting of reviews,"plospath, jcb, emboj","consider, reject, reject",3,accept,accept,,
214,RC-2020-00345,2020-06-29 03:31:42.893,Yes,suggest posting of reviews,mboc,consider,1,,1 transfer,Mol Biol Cell,n


In [28]:
external_jou = merged[merged['external_jou'].notnull()]
external_jou = external_jou.groupby('num_transfers').count()
external_jou = external_jou['manuscript_nm']
external_jou

num_transfers
0    17
1    12
2     3
3     6
Name: manuscript_nm, dtype: int64

The current number of transfers for all manuscripts (remember that zero transfers include editorial rejections from Rev Com.)

In [29]:
num_transfers = outcomes.groupby('num_transfers').count()
num_transfers = num_transfers['manuscript_nm']
num_transfers

num_transfers
0    61
1    59
2    52
3    32
4    12
Name: manuscript_nm, dtype: int64

Sanity check on total number of Rev Com submissions.

In [30]:
assert num_transfers.sum() == preprint_availability.sum()  # sanity check
total = num_transfers.sum()
total

216

Current number of transfers (this is NOT equal to the flow!)

In [31]:
zero_transfers = num_transfers[0]
zero_transfers

61

For each current transfer status, count those that were accepted in a journal of the consoritum (rejections are irrelevant here)

In [32]:
one_transfer = outcomes.loc[outcomes['num_transfers'] == 1]
one_transfer = one_transfer.groupby('journal_decision').count()
one_transfer = one_transfer['manuscript_nm']
one_transfer

journal_decision
accept    30
reject     2
Name: manuscript_nm, dtype: int64

In [33]:
two_transfers = outcomes.loc[outcomes['num_transfers'] == 2]
two_transfers = two_transfers.groupby('journal_decision').count()
two_transfers = two_transfers['manuscript_nm']
two_transfers


journal_decision
accept    32
reject     1
Name: manuscript_nm, dtype: int64

In [34]:
three_transfers = outcomes.loc[outcomes['num_transfers'] == 3]
three_transfers = three_transfers.groupby('journal_decision').count()
three_transfers = three_transfers['manuscript_nm']
three_transfers

journal_decision
accept    17
Name: manuscript_nm, dtype: int64

In [35]:
four_transfers = outcomes.loc[outcomes['num_transfers'] == 4]
four_transfers = four_transfers.groupby('journal_decision').count()
four_transfers = four_transfers['manuscript_nm']
four_transfers

journal_decision
accept    4
Name: manuscript_nm, dtype: int64

In [36]:
# mau transferred to first journal are all those that are not currently with zero trasnfer
# note that zero transfer includes pending manu that were never transferred
# as well as editorially rejected manu, that are never transferred by definition
revcom2one = total - zero_transfers
revcom2rejected = ed_rejected

rejected2inside = published_after_ed_rej['y']
rejected2outside = published_after_ed_rej['n']

one2two = revcom2one - num_transfers[1]
one2accept_inside = one_transfer['accept']
one2accept_outside = external_jou[1]

two2three = one2two - num_transfers[2]
two2accept_inside = two_transfers['accept']
two2accept_outside = external_jou[2]

three2four = two2three - num_transfers[3]
three2accept_inside = three_transfers['accept']
three2accept_outside = external_jou[3]

four2accept_inside = four_transfers['accept']
four2accept_outside =  0# =external_jou[4]
print("total", total)
print("revcom2one, revcom2rejected", revcom2one, revcom2rejected)
print("rejected2inside, rejected2outside", rejected2inside, rejected2outside)
print("one2two, one2accept_inside, one2accept_outside", one2two, one2accept_inside, one2accept_outside)
print("two2three, two2accept_inside, two2accept_outside", two2three, two2accept_inside, two2accept_outside) 
print("three2four, three2accept_inside, three2accept_outside", three2four, three2accept_inside, three2accept_outside)
print("four2accept_inside, four2accept_outside", four2accept_inside, four2accept_outside)

total 216
revcom2one, revcom2rejected 155 53
rejected2inside, rejected2outside 6 21
one2two, one2accept_inside, one2accept_outside 96 30 12
two2three, two2accept_inside, two2accept_outside 44 32 3
three2four, three2accept_inside, three2accept_outside 12 17 6
four2accept_inside, four2accept_outside 4 0


Average number of transfers for accepted papers

In [37]:
avg_transfer = outcomes.loc[outcomes['journal_decision'] == 'accept', 'num_transfers'].mean()
avg_transfer

1.9397590361445782

In [38]:
def prct(x):
    percent_total = f"{int(100*x/int(total))}%"
    return percent_total

In [39]:
nodes = {
    "label": ["preprint", "no preprint", "rev com", "rev com reject", "1st j", "2nd j", "3rd j", "4th j", "accept inside", "accept outside"],
    "x":       [0.1,        0.1,          0.25,       0.4,              0.4,     0.5,     0.6,      0.7,     0.9,               0.9             ],
    "y":       [0.1,        0.8,          0.50,       0.8,              0.30,     0.35,     0.38,      0.4,     0.1,             0.8             ],
    "colors": ["maroon", "darkgoldenrod", "midnightblue", "black", "midnightblue", "midnightblue", "midnightblue", "midnightblue", "green", "forestgreen"]
}
flows = [
    ("preprint", "rev com", preprint_available, (0.16, 0.78)),
    ("no preprint", "rev com", no_preprint, (0.16, 0.28)),
    ("rev com", "rev com reject", revcom2rejected, (0.3, 0.15)),
    ("rev com", "1st j", revcom2one, (0.3, 0.7)),
    ("rev com reject", "accept inside", rejected2inside, (0.55, 0.35)),
    ("rev com reject", "accept outside", rejected2outside, (0.6, 0.18)),
    ("1st j", "2nd j", one2two, (0.45, 0.65)), 
    ("1st j", "accept inside", one2accept_inside, (0.7, 1.05)),
    ("1st j", "accept outside", one2accept_outside, (0.7, 0.3)),
    ("2nd j", "3rd j", two2three, (0.55, 0.58)),
    ("2nd j", "accept inside", two2accept_inside, (0.75, 0.88)),
    ("2nd j", "accept outside", two2accept_outside, (0.7, 0.4)),
    ("3rd j", "4th j", three2four, (0.67, 0.58)),
    ("3rd j", "accept inside", three2accept_inside, (0.8, 0.8)),
    ("3rd j", "accept outside", three2accept_outside, (0.75, 0.45)),
    ("4th j", "accept inside", four2accept_inside, (0.82, 0.7)),
    ("4th j", "accept outside", four2accept_outside, (10, 10))
]
sources = [nodes['label'].index(f[0]) for f in flows]
targets = [nodes['label'].index(f[1]) for f in flows]
values = [f[2] for f in flows]
percentages = [prct(f[2]) for f in flows]
annot_coord = {
    "x": [f[3][0] for f in flows],
    "y": [f[3][1] for f in flows]
}

In [40]:
fig = go.Figure(data=[
    go.Sankey(
        # arrangement="snap",   
        node = {
            "pad": 20,
            "thickness": 20,
            "line":{"color": "white", "width": 0.5},
            "label": nodes['label'],
            "x":     nodes['x'],
            "y":     nodes['y'],
            "color": nodes['colors'],
        },
        link = {
            "source": sources,
            "target": targets,
            "value":  values,
            "color": "rgba(100, 100, 100, 0.1)",
            "label": percentages,
        },
    )
])
for i in range(len(flows)):
    fig.add_annotation(
        xref="paper", yref="paper",
        x=annot_coord['x'][i], y=annot_coord['y'][i],
        text=percentages[i],
        showarrow=False
)
fig.add_annotation(
    x=1.1, y=0.95, xref="paper", yref="paper",
    text=f"Average number<br>of transfers:<br>{avg_transfer:.2f}",
    showarrow=False,
    font={"size": 12}
)
fig.update_layout(title_text="Transfer flux and outcomes", font_size=10)
fig.show()
fig.write_image("./img/manu_flow.png")

<img src="img/manu_flow.png">