In [17]:
from pathlib import Path
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from typing import Dict, List, Tuple
import torch

In [1]:
# https://plotly.com/python/static-image-export/
import plotly.io as pio
pio.renderers.default='jupyterlab'

In [2]:
import sys
import os

app = "/app"
if app not in sys.path:
    sys.path.append(app)

In [3]:
from src.comparator import Comparator
from src.embed import OpenAIEmbedder, SBERTEmbedder, BarlowEmbedder
from src.preprint import Preprint
from src.review_process import ReviewProcess
from src.utils import split_paragraphs, split_sentences
from src.config import config

In [4]:
config

Config(embedding_model={'openai': 'text-embedding-ada-002', 'sbert': 'all-mpnet-base-v2', 'barlow': '/app/pretrained/twin-lm-checkpoint-32000'}, sections='introduction+results+discussion+methods')

In [5]:
doi = "10.1101/2021.05.12.443743"

In [6]:
preprint = Preprint(doi=doi)

In [7]:
review_process = ReviewProcess(doi=doi)

In [8]:
preprint_chunks = preprint.get_chunks(split_paragraphs)
# preprint_chunks = preprint.get_chunks(split_sentences)

In [9]:
review_1_chunks = review_process.reviews[0].get_chunks(split_paragraphs)
# review_chunks = review_process.reviews[0].get_chunks(split_sentences)
review_2_chunks = review_process.reviews[1].get_chunks(split_paragraphs)
review_3_chunks = review_process.reviews[2].get_chunks(split_paragraphs)

In [10]:
print(f"preprint chunks: {len(preprint_chunks)}; review chunks: {len(review_1_chunks)}")

preprint chunks: 52; review chunks: 16


In [11]:
# embedder = SBERTEmbedder()
embedder = OpenAIEmbedder()
# embedder = BarlowEmbedder(model="/app/pretrained/twin-lm-checkpoint-32000", mode='paragraph')

print(f"chosen={embedder.model}")

chosen=text-embedding-ada-002


In [12]:
comp = Comparator(embedder)

In [20]:
similarity_matrix = comp.compare_dot(review_1_chunks, preprint_chunks)
similarity_matrix.size()

torch.Size([16, 52])

In [23]:
cutoff = 0.88
fig = px.imshow(
    torch.clamp(similarity_matrix, cutoff, 1),
    title = f"{embedder.model}, at cutoff={cutoff}",
    template='seaborn'
)
fig.update_layout(
    height=500,
    width=1000,
)
fig.show()

In [15]:
for i, p in enumerate(review_1_chunks[:200]):
    print(i, p)

0 In this manuscript, Mishima et al., designed a reporter system (dubbed PACE, for Parallel Analysis of Codon Effects) to assess the effect of codon usage in regulating mRNA stability in a controlled sequence context. This reporter corresponds to a stretch of 20 repetitions of a given codon (to be tested for its effect on mRNA stability), each repetition being separated by one codon corresponding to each of the 20 canonical amino acids. This stretch is inserted at the 3\' end of the coding sequence of a superfolder GFP flanked with fixed 5\' and 3\' untranslated regions. In vitro transcribed capped and polyadenylated RNAs are then produced from these reporters (each with a specific stretch of repetitions of a given codon), pooled together and injected into zebrafish zygotes to monitor their relative abundance at different time points upon injection.
1 Using the PACE reporter, the authors were able to obtain a quantitative estimation of the impact of 58 out of the 61 sense codons on mod

In [16]:
for i, p in enumerate(review_2_chunks[:200]):
    print(i, p)

0 In this manuscript, Mishima et al aim to determine if the RNA-mediated decay determined by codon optimality is part of the ribosome quality control pathway, triggered by slowed codon decoding and ribosome stalling or it is an independent pathway.
1 To this end, the authors capitalize on their previous work to design a very elegant high-throughput reporter system that can analyze individually codon usage, ribosome occupancy and tRNA abundance. This reporter system, called PACE, is rigorously validated throughout the manuscript, because blocking translation with a morpholino blocking the AUG codon demonstrated that the effects no RNA stability are translation dependent.
2 When most of the available codons are tested using the PACE system, the authors recapitulate codon optimality profiles similar to the ones previously uncovered using transcriptome-wide approaches.
3 Thanks to the design of the reporter, which alternates repeats of a test codon with random codons, the authors can calcu

In [37]:
for i, p in enumerate(preprint_chunks[:200]):
    print(i, p)

0 During the translation of mRNA to protein, a sequence of codons is decoded by tRNAs on the ribosome. While the translation elongation process is strictly controlled, the speed of ribosome movement passing through codons is not uniform. For example, each codon is decoded with variable efficiency due to the difference in tRNA availability (Hussmann et al., 2015; Varenne et al., 1984; Weinberg et al., 2016). Particular pairs of amino acids and codons inhibit peptide bond formation and/or decoding (Doerfel et al., 2013; Gamble et al., 2016; Schuller et al., 2017; Ude et al., 2013). The nascent polypeptide may interact with the ribosome exit tunnel and retard the ribosome (Ito and Chiba, 2013; Wilson et al., 2016). As a result, ribosomes traverse along the ORF with distinct kinetics unique to each coding sequence (Choi et al., 2018).
1 Nonuniform movement of the ribosome is not always harmful and is instead beneficial to the cell (Stein and Frydman, 2019). In bacteria, programmed ribosome