# Extended solution

*Papers that somewhat resemble my ideas or just may be useful*:  
+ [Neural Extractive Text Summarization with Syntactic Compression](https://aclanthology.org/D19-1324.pdf) *by Jiacheng Xu and Greg Durrett*:
    + The approach encodes the text and *then* performs the compression;
    + https://github.com/jiacheng-xu/neu-compression-sum (gosh, what a horrible code)
+ [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](https://aclanthology.org/P18-1063.pdf) *by Yen-Chun Chen and Mohit Bansal*:
    + This approach utilizes RL to rewrite sents, along with abstractive summarization;
    + https://github.com/ChenRocks/fast_abs_rl (much more readable repo)
+ [Extractive Summarization as Text Matching](https://arxiv.org/pdf/2004.08795v1.pdf) *by Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang*:
    + https://github.com/maszhongming/MatchSum
    + CNN/DailyMail to a new level (44.41 in ROUGE-1);

### Work structure
#### 1. Dataset loading and preprocessing

#### 2. Implementing heuristics as standalone preprocessing functionality (postponed)
This work suggest usiing some preprocessing tricks to improve the actual effect of summarization. 
The following ways are suggested (each is to be implemented in separate notebook):
1. Utilize coreference resolution among sentences so we won't miss important nouns in our summary.
2. Try to split compound sentences into few smaller ones.
3. Compress resulting sentences so we exclude some low-informative words without (hopefully) sacrifising the readability.
    - Named entities intuitively seem more important that common nouns, so they are not to be deleted.
    
#### 3. Implementation of summarization block per-se
#### 4. Evaluation and metrics


## 1. Dataset

From here: https://cs.nyu.edu/~kcho/DMQA/

## 2. Heuristics (postponed)

See notebooks 3.1, 3.2, 3.3

## 3. Summarization block

For my summarization block I'd like to make use of the novel method suggested by 

The main keypoint of the work:
> Instead of scoring and extracting sentences
> one by one to form a summary, we formulate
> extractive summarization as a semantic text matching
> problem and propose a novel summary-level
> framework. Our approach bypasses the difficulty
> of summary-level optimization by contrastive learning,
> that is, a good summary should be more
> semantically similar to the source document than the
> unqualified summaries.


> Inspired by siamese network structure (Bromley
et al., 1994), we construct a Siamese-BERT archi
tecture to match the document D and the candidate
summary C. Our Siamese-BERT consists of two
BERTs with tied-weights and a cosine-similarity
layer during the inference phase.

Okay, we will need some BERT...
> we use the vector of the ‘\[CLS]’ token from the top BERT layer as the representation of a document or summary.


We also compare simiarities...
> Let $r_D$ and $r_C$ denote the embeddings of the document D and candidate summary C. Their similarity score is measured by
$f(D,C) = cosine(r_D,r_C)$.

And the loss...
> In order to fine-tune Siamese-BERT, we use a margin-based triplet loss to update the weights

Doesn't search though all possible candidates can be of $\sum_{i=1}^{n}C_n^i$ variants?
> In the inference phase, we formulate extractive summarization as a task to search for the best summary among all the candidates C extracted from the document D.

... and yeah, exactly:
> The matching idea is more intuitive while it suffers from combinatorial explosion problems. \[...] we introduce a content selection module to pre-select salient sentences.

Abovementioned content selection is done via [PreSumm](https://github.com/nlpyang/PreSumm) model

### May be useful:
+ "We truncate each document to 512 tokens and feed them to MatchSum because the pre-trained models (BERT, RoBERTa) has a maximum length limit." [github](https://github.com/maszhongming/MatchSum/issues/9#issuecomment-637904607)

## 3.0 Suggested flow
The paper suggests the following workflow:
1. You score the sentences of the input text with some third-party model accordingly to their presumed informational contribution to the meaning of a text.
2. You get some summary candidates based on combinatorial allocations (hyperparam-dependednt, so you can affect the number of output sentences) with respect to the top scores from the step 1.
3. The deeplearning model learns to choose the best one from the summaries, at the same time avoiding common pitfalls of usual models (the authors of the paper suggest "pearl-summary vs. best-summary" problem).

### \[Preparation] Creating dataset

The dataset I will use needs to be in a certain form. The original solution suggests jsonl format with json objects separated by newline token.
I don't mind it. 

So, first of all, I preprocess and convert my dataset into suitable format


No code here for now, refer to `dataset_utils/cnndm_preprocessor`

Also I truncate the dataset to 10k first docs. The original paper states that the training took 30 hours with several top GPUs. I have none.

### Scoring step

We are suggested to somehow score our sentences by their importance. I will use trivial method for that, and also I will not truncate my input on this stage, as long as it can be a hyperparam.


 > - truncate each document into the 5 most important sentences (using BertExt), 
   then select any 2 or 3 sentences to form a candidate summary, so there are C(5,2)+C(5,3)=20 candidate summaries.
   if you want to process other datasets, you may need to adjust these numbers according to specific situation.

BertExt has very questionable codebase and maintainment, so I will stuck with centroidal sorting for this case.
Moreover, I will use not Bert or RoBERTa, but LaBSE simply because I've already used.
Thus, if my integrated encoder will be LaBSE, why shouldn't I just utilize it to create ranking of the sentences?

No code here for now, refer to `dataset_utils/create_sentence_ranking`

### Dataset loader
The dataset loader was recreated inspired by the MatchSum repo, but with much more readable variable names and with use of torch Dataset instead of any relation on fastNLP (wtf?) library.

In [None]:
from pathlib import Path
from dataset_utils.dataset import CNNDMDataset

textual_file_path = Path("../data/cnndm/dataset10k.jsonl")
indices_file_path = Path("../data/cnndm/sent_id10k.jsonl")


### Creating matcher

In [None]:
# Callbacks
class

## 4. Eval and metrics