In [1]:
import json
import random
import tqdm
from tqdm import tqdm

import numpy as np
import pandas as pd

## 1. Load data

In [2]:
def load_data(filepath):
    d = {}
    with open(filepath) as f:
        for i, line in enumerate(f):
            d[i] = json.loads(line)
    return d

d = load_data("../misc/test_dataset.jsonl")
dataset = pd.DataFrame.from_dict(d).T
display(dataset.head())
print(f"Total number of rows: {len(dataset)}")

Unnamed: 0,citation_sentence,manuscript_id,cited_id,cited_text,manuscript_result_text
0,"In addition, we here report comparable changes...",8281087,15228934,ietary incorporation of plant sterols and stan...,The systematic search retrieved 1084 potentia...
1,"In addition, we here report comparable changes...",8281087,20547173,Elevated plasma total cholesterol (TC) 5 and L...,The systematic search retrieved 1084 potentia...
2,"In addition, we here report comparable changes...",8281087,15671550,"Phytosterols (PS), comprising both plant stero...",The systematic search retrieved 1084 potentia...
3,Such a discrepancy may likely be due to dose-o...,8281923,6704669,"3,4-Methylenedioxymethamphetamine (MDMA; ""ecst...",Two earlier studies have reported that male 5...
4,"Similar to earlier observations (42) , the inf...",11155963,4009171,Plasmodium falciparum is metabolically highly ...,"Cell membrane scrambling, a hallmark of erypt..."


Total number of rows: 100


In [3]:
i = 0
citation_sentence = dataset.loc[i, "citation_sentence"]
cited_text = dataset.loc[i, "cited_text"]
manuscript_result_text = dataset.loc[i, "manuscript_result_text"]

print(citation_sentence)

In addition, we here report comparable changes in serum cholesterol and lipoprotein concentrations as found in previously published meta-analyses [1] [2] [3] , implying that the included studies in this meta-analyses are representative of all available studies in the literature that have been performed with plant sterols and plant stanols.


## 2. Tokenize texts into sentences

In [4]:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters


# tokenizer should not split at abbrieviations
punkt_params = PunktParameters()
punkt_params.abbrev_types = set(["i.e", "e.g", "etc", "al", "fig", "figs", 
                                 "ref", "refs", "p", "c", "s"]) 

# initialise sentence tokenizer
tokenizer = PunktSentenceTokenizer(punkt_params)

In [5]:
# tokenize manuscript_result_text, cited_text into sentences
query_sentences = tokenizer.tokenize(manuscript_result_text)
cited_sentences = tokenizer.tokenize(cited_text)

for j in range(10): print(query_sentences[j])
print()
for j in range(10): print(cited_sentences[j])

 The systematic search retrieved 1084 potentially relevant papers, and after two selection rounds, 41 RCTs were included in the meta-analysis.
A flowchart of the study selection process is presented in Fig. 1 .
Of the 41 included studies (Online Supplemental Material Tables 1 and 2 ), 23 were conducted as a parallel study [15-17, 19, 21-26, 28-31, 34, 38-45] and 18 studies had a crossover design [13, 14, 18, 20, [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59]  The weighted effects of plant sterol or plant stanol consumption on plasma fat-soluble vitamin and carotenoid concentrations are presented in Table 1 .
Non-standardized and TC-standardized hydrocarbon carotenoid concentrations, i.e., lycopene, α-carotene and β-carotene, were significantly (P < 0.0001) lowered after consumption of plant sterol-or plant stanol-enriched foods.
β-Carotene For parallel studies, the weighted average baseline concentrations were calculated based on the baseline concentrations in th

## 3. Compute sentence embeddings

In [6]:
# import sent2vec

# model = sent2vec.Sent2vecModel()
# model.load_model('model.bin') 
# emb = model.embed_sentence("once upon a time .") 
# embs = model.embed_sentences(["first sentence .", "another sentence"])

- I've been unable to download any of the .bin files from https://github.com/epfml/sent2vec

In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("stsb-distilbert-base")

In [8]:
# compute sentence embeddings
query_embeddings = model.encode(query_sentences)
cited_embeddings = model.encode(cited_sentences)

## 4. Compute sentence similarity scores

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(query_embeddings, cited_embeddings)
print(similarity.shape)

(23, 152)


In [10]:
# sort query_sentence-cited_sentence pairs by decreasing similarity scores
l = []
indexes = np.argwhere(similarity > 0.6)
for ind in indexes: 
    l.append([ind, similarity[ind[0], ind[1]], 
              query_sentences[ind[0]], cited_sentences[ind[1]]])
sentence_pairs = pd.DataFrame(l)    
sentence_pairs.columns = ["indices", "score", "query_sentence", "cited_sentence"]
sentence_pairs.sort_values(by=["score"], ascending=False, inplace=True)

print(citation_sentence)
for i, row in sentence_pairs.iterrows():
    print(f"\nscore: {row['score']:.5f}")
    print(f"query: {row['query_sentence']}")
    print(f"cited: {row['cited_sentence']}")

In addition, we here report comparable changes in serum cholesterol and lipoprotein concentrations as found in previously published meta-analyses [1] [2] [3] , implying that the included studies in this meta-analyses are representative of all available studies in the literature that have been performed with plant sterols and plant stanols.

score: 0.73832
query: Non-standardized and TC-standardized hydrocarbon carotenoid concentrations, i.e., lycopene, α-carotene and β-carotene, were significantly (P < 0.0001) lowered after consumption of plant sterol-or plant stanol-enriched foods.
cited: In conclusion, plant sterol/stanol containing products significantly reduced LDL concentrations but the reduction was related to individuals' baseline LDL levels, food carrier, frequency and time of intake.

score: 0.72639
query: HDL-C concentrations did not change after plant sterol or plant stanol consumption (0.2 %), while TAG concentrations were significantly (P < 0.0001) decreased by 0.06 mmol/L

In [11]:
# get k=5 most similar cited_sentences
top_cited_sentences = sentence_pairs.cited_sentence[:5]
for s in top_cited_sentences: print(s)
    
top_cited_sentences = " ".join(top_cited_sentences)

In conclusion, plant sterol/stanol containing products significantly reduced LDL concentrations but the reduction was related to individuals' baseline LDL levels, food carrier, frequency and time of intake.
For instance, plant sterols/stanols consumed 2Á3 times/day reduced LDL cholesterol levels by 0.34 mmol/L (95% CI: (0.38, (0.18) while plant sterols/stanols consumed once per day in the morning did not result in a significant reduction in LDL levels.
Plat et al. (31) showed that 2.5 g of plant stanols in margarines and shortenings consumed for four weeks once per day at lunch or divided over three meals, lowered LDL cholesterol levels to a similar extent, about 10%.
The present meta-analysis has confirmed that baseline LDL cholesterol levels affect magnitude of reduction in LDL after plant sterol/stanol consumption which could explain the wide variation in responsiveness seen in previous studies.
Another (5) looked at the efficacy and safety of plant sterols/stanols as cholesterol lo

## 5. (Abstractively) summarise the top few sentence pairs

In [12]:
import torch
from transformers import pipeline

if torch.cuda.is_available(): device = torch.cuda.device(0)

In [13]:
# load pipeline with BART model,  trained on the CNN/Daily Mail News Dataset
summarizer = pipeline("summarization")

In [14]:
# perform summarization
summary_text = summarizer(top_cited_sentences, max_length=30, min_length=5, do_sample=False)[0]["summary_text"]

print(citation_sentence)
print()
print(summary_text)

In addition, we here report comparable changes in serum cholesterol and lipoprotein concentrations as found in previously published meta-analyses [1] [2] [3] , implying that the included studies in this meta-analyses are representative of all available studies in the literature that have been performed with plant sterols and plant stanols.

 Plant sterols/stanols consumed 2Á3 times/day reduced LDL cholesterol levels by 0.34 mmol/L (95


- summary looks quite similar to gold citation sentence, but is not as generic
    - possibly due to the gold citation sentence actually citing three papers, but this summary comes from only one of the cited papers
    - idea: try getting most similar sentences from all cited papers and summarising them?

## 5. Trial with another sentence

In [15]:
i = 3
citation_sentence = dataset.loc[i, "citation_sentence"]
cited_text = dataset.loc[i, "cited_text"]
manuscript_result_text = dataset.loc[i, "manuscript_result_text"]

print(citation_sentence)

Such a discrepancy may likely be due to dose-or species-specific differences since an acute administration of 3.3 mg/kg MDMA did not stimulate locomotion in mice (Scearce-Levie et al, 1999) , but did in rats ( Bankson and Cunningham, 2002) .


In [16]:
# tokenize manuscript_result_text, cited_text into sentences
query_sentences = tokenizer.tokenize(manuscript_result_text)
cited_sentences = tokenizer.tokenize(cited_text)

# compute sentence embeddings
query_embeddings = model.encode(query_sentences)
cited_embeddings = model.encode(cited_sentences)

# compute similarity matrix
similarity = cosine_similarity(query_embeddings, cited_embeddings)

# sort query_sentence-cited_sentence pairs by decreasing similarity scores
l = []
indexes = np.argwhere(similarity > 0.6)
for ind in indexes: 
    l.append([ind, similarity[ind[0], ind[1]], 
              query_sentences[ind[0]], cited_sentences[ind[1]]])
sentence_pairs = pd.DataFrame(l)    
sentence_pairs.columns = ["indices", "score", "query_sentence", "cited_sentence"]
sentence_pairs.sort_values(by=["score"], ascending=False, inplace=True)

for i, row in sentence_pairs.iterrows():
    print(f"\nscore: {row['score']:.5f}")
    print(f"query: {row['query_sentence']}")
    print(f"cited: {row['cited_sentence']}")


score: 0.73420
query: Thus, 5-HT 1B receptor agonists produced a more sustained hypophagia than does MDMA (Figure 1b) .
cited: This suggests that 5-HT 2C R stimulation can mask 5-HT 1B/1D R-mediated hyperactivity.

score: 0.72737
query: Pharmacological studies have reported that the 5-HT 2C receptor antagonist SB242084 reduced the anorectic effects of both m-chlorophenylpiperazine, a mixed 5-HT 2A,2B,2C receptor agonist (Kennett et al, 1997) , and fenfluramine (Vickers et al, 2001 ).
cited: More recently, the selective 5-HT 1B/1D R antagonist GR 127935 has been shown to block ( ϩ )-MDMA-induced hyperactivity in rats to the level of saline controls .

score: 0.71703
query: Treatments that differ significantly from saline in wild-type mice are marked ( Since both MDMA and 5-HT 1B receptor agonists reduced food intake in starved mice, we thought that 5-HT 1B receptors could be involved in MDMA-delayed eating.
cited: More recently, the selective 5-HT 1B/1D R antagonist GR 127935 has been 

In [17]:
# get k=5 most similar cited_sentences
top_cited_sentences = " ".join(sentence_pairs.cited_sentence[:5])

# perform summarization
summary_text = summarizer(top_cited_sentences, max_length=50, min_length=5, do_sample=False)[0]["summary_text"]

print(citation_sentence)
print()
print(summary_text)

Such a discrepancy may likely be due to dose-or species-specific differences since an acute administration of 3.3 mg/kg MDMA did not stimulate locomotion in mice (Scearce-Levie et al, 1999) , but did in rats ( Bankson and Cunningham, 2002) .

 The selective 5-HT 1B/1D R antagonist GR 127935 has been shown to block ( ϩ )-MDMA-induced hyperactivity in rats to the level of saline controls . The inability of SB 206553


- summary_text ends abruptly?? 
- gold citation sentence is actually a summary of results of two cited papers, but summary_text only comes from the paper on rats 