<a href="https://colab.research.google.com/github/ris27hav/ACA-Wikipedia-Simplifier/blob/main/Final%20model/Team_4/train_muss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia Simplifier

This is a simplified (xD) implementation of the MUSS model presented in the paper [MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases](https://arxiv.org/pdf/2005.00352.pdf) using *Pytorch* and *transformers* library. I have implemented it for English language only. This is not the exact implementation as described in the paper but similar to it. 

Since it is nearly impossible to train this model using Colab or using just 1 GPU and limited storage, so I have used the pretrained model for evaluations. Though the functions can be tested invidually to know if they work the way they are intended to do. Please check the other notebook to see the model in action.

###Approach

####**1. Mining Paraphrases**

>**a) Sequence Extraction**:
>A sequence consists of multiple sentences: to allow sentence splitting or fusion operations.

* get docs from HEAD split in CCNet (categorised by language)
* doc -> sentences using NLTK sentence tokenizer
* adjacent sentences -> sequences with max character = 300
* filter sequences: remove those having >= 10% puctutation characters or low language model probability according to a 3-gram Kneser-Ney language model trained with kenlm on Wikipedia


>**b) Creating a Sequence Index using Embeddings**

* extracted sequence (1 billion) -> 1024-dimensional embeddings using LASER (reduced to 512 by using PCA followed by random rotation)


>**c) Mining paraphrases**

* Index these embeddings (for use with faiss)
* each sequence is used as a query (q_i) against these 1 billion sequences to find top 8 nearest neighbour using L2 distance (faiss), keep those with L2 distance < 0.05 and relative distance with other 7 neighbours < 0.6
* paraphrase filtering -<br>
    - remove almost identical pp with character Levenshtein distance <= 20%
    - remove pp coming from same document
    - remove pp where one sqeuence is contained in other


####**2. Simplifying with ACCESS**

* ACCESS is a method to make any seq2seq model controllable by conditioning on simplification-specific control tokens.
* Apply this to seq2seq pretrained transformer models based on the BART.

>**Training with control tokens**

- During train time, control tokens provided to model that give info about target sequence.
- During inference time, control the generation by selecting a given target control value.
- Prepend the following control tokens to every source in training set:
		<NumChars_XX%> : Character Length Ratio
		<LevSim_YY%> : replace-only Levenshtein similarity
		<WordFreq_ZZ%> : aggregated word frequency ratio
		<DepTreeDepth_TT%> : dependency tree depth ratio

>**Selecting Control values at Inference**

- Shorter sentences are more adapted to people with cognitive disabilities, while using more frequent words are useful to second language learners.
- Choose these hyperparameters based on SARI score on validation set or by using prior knowledge based on target audience.


####**3. Leveraging Unsupervised Pretraining**

* Fine tune the pretrained generative model BART on the newly created training corpora.

---

### Loading the data for mining paraphrases

**Get data from CCNet**

In [2]:
!git clone https://github.com/facebookresearch/cc_net

Cloning into 'cc_net'...
remote: Enumerating objects: 471, done.[K
remote: Total 471 (delta 0), reused 0 (delta 0), pack-reused 471[K
Receiving objects: 100% (471/471), 169.97 KiB | 4.59 MiB/s, done.
Resolving deltas: 100% (329/329), done.


In [6]:
%cd cc_net/

/content/cc_net


In [7]:
!mkdir ./data/

In [None]:
!python -m pip install .[getpy]

In [None]:
# !python -m cc_net --dump 2019-13

# Note : this won't work here because it requires 7 TB storage :)

**Get sample data from wikipedia**

In [1]:
# For testing purpose, we can provide some sample data
%pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=6b368574f746b01ed9852ed3179de0aabed882e83b71ba7dd02d8d0818eb250e
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [1]:
import wikipedia

In [2]:
# upload simple_wiki.txt file

def get_data():
    titles = ["Messi", "Christiano Ronaldo", 'messi-ronaldo rivalry', 'chernobyl', 'artificial intelligence', 'Hinduism']
    docs = []
    with open('simple_wiki.txt') as fo:
        data = fo.read()
        docs.append(data)
    for title in titles:
        page = wikipedia.page(title)
        docs.append(page.content)
    return docs

### Mining Paraphrases

In [27]:
# Install required dependencies
%pip install laserembeddings python-Levenshtein faiss faiss-cpu



In [28]:
# Import required libraries
import numpy as np
import faiss

from string import punctuation
from nltk.tokenize import sent_tokenize
from laserembeddings import Laser
from sklearn.decomposition import PCA
from Levenshtein import distance as levenshtein_distance

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Sequence Extraction**

In [29]:
def doc_to_sentences(doc):
    """
    Splits a document into sentences.
    """
    doc = doc.replace('\n', ' ').replace('\t', ' ').replace('\x00', ' ')
    return sent_tokenize(doc)


def filter_sentences(sentences, punc_ratio=0.1, lang_model=None, lang_prob=0.5):
    """
    Filters sentences by removing those which contain a high number of punctuation marks 
    or have low language model probability or are too small.
    """
    filtered_sentences = []

    for seq in sentences:
        if len(seq) < 30:
            continue

        if lang_model is not None:
            prob = lang_model.prob(seq)
            if prob < lang_prob:
                continue

        num_punc = sum(1 for c in seq if c in punctuation)
        if num_punc / len(seq) < punc_ratio:
            filtered_sentences.append(seq)

    return filtered_sentences


def generate_sequences(sentences, max_chars=300):
    """
    Generates sequences of adjacent sentences from a list of sentences.
    """
    sequences = []
    total_sentences = len(sentences)

    for i in range(total_sentences):
        cur_seq = sentences[i]
        cur_chars = len(cur_seq)
        if cur_chars > max_chars:
            continue
        
        sequences.append(cur_seq)
        for j in range(i+1, total_sentences):
            cur_chars += len(sentences[j])
            if cur_chars > max_chars:
                break
            
            cur_seq += ' ' + sentences[j]
            sequences.append(cur_seq)

    return sequences

**Creating a sequence index using embeddings**

In [30]:
!python -m laserembeddings download-models

Downloading models into /usr/local/lib/python3.7/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


In [31]:
def compute_embeddings(sequences, dim=512):
    """
    Computes the embeddings for a list of sequences.
    """
    laser = Laser()
    embeddings = laser.embed_sentences(sequences, lang='en')
    # embeddings is a N*1024 (N = number of sentences) NumPy array

    pca = PCA(n_components=dim)
    embeddings = pca.fit_transform(embeddings)
    return embeddings

**Mining Paraphrases**

In [32]:
def index_embeddings(embeddings):
    """
    Indexes a list of embeddings using a FAISS index.
    """
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index


def get_nearest_neighbors(index, embeddings, k=8):
    """
    Returns the k nearest neighbors of each embedding in a list of embeddings.
    """
    D, I = index.search(embeddings, k)
    return D, I


def filter_nearest_neighbors(D, I, max_L2_dist=0.05):
    """
    Filters the nearest neighbors to remove those which are too far from the queries.
    """
    filtered_neighbors = np.ones(I.shape) * (-1)
    for i in range(I.shape[0]):
        for j in range(I.shape[1]):
            if D[i,j] <= max_L2_dist:
                filtered_neighbors[i,j] = I[i,j]
    
    filtered_neighbors = filtered_neighbors.astype(int)
    return filtered_neighbors


def filter_paraphrases(I, sequences, min_l_dist=0.2):
    """
    Removes almost identical pp with character level Levenshtein distance <= 20%
	or pp from coming same document         ** (need to implement this) **
	or pp where one sequence is contained in other
    """
    for i in range(I.shape[0]):
        cur_seq = sequences[i]
        for j in range(I.shape[1]):
            if I[i,j] == -1:
                continue
            
            target_seq = sequences[I[i,j]]
            dist = levenshtein_distance(cur_seq, target_seq)
            if dist <= min_l_dist:
                I[i,j] = -1
                continue
            
            if cur_seq in target_seq or target_seq in cur_seq:
                I[i,j] = -1

    return I


def generate_aligned_paraphrases(I, sequences):
    """
    Generates a list of paraphrases from the list of sequences and their nearest neighbors.
    """
    paraphrases = []
    for i in range(I.shape[0]):
        cur_seq = sequences[i]
        for j in range(I.shape[1]):
            if I[i,j] == -1:
                continue
            
            target_seq = sequences[I[i,j]]
            paraphrases.append((cur_seq, target_seq))
    
    return paraphrases

---

### Simplifying with ACCESS

In [9]:
!pip install python-Levenshtein



In [10]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz

--2021-07-31 19:06:58--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz.1’


2021-07-31 19:07:52 (23.8 MB/s) - ‘cc.en.300.vec.gz.1’ saved [1325960915/1325960915]



In [11]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [12]:
import Levenshtein
import spacy
import numpy as np
import gzip
import en_core_web_md

**Character length Ratio**

In [13]:
def get_character_length_ratio(original_seq, target_seq):
    """
    Return the ratio (in %) of the length of the target sequence
    to the length of the original sequence.
    """
    return (len(target_seq) / len(original_seq)) * 100

**Replace-Only Levenshtein Similarity**

In [14]:
def get_replace_only_levenshtein_similarity(original_seq, target_seq):
    """
    Return the ratio (in %) of the Levenshtein distance between the target 
    sequence and the original sequence, where only replacements are considered.
    """
    distance = len(
        [
            _
            for operation, _, _ in Levenshtein.editops(original_seq, target_seq)
            if operation == "replace"
        ]
    )
    max_replace_only_distance = min(len(original_seq), len(target_seq))
    if max_replace_only_distance == 0:
        return 0
    return (1 - (distance / max_replace_only_distance)) * 100

**Word Frequency Ratio**

In [48]:
def yield_lines():
    with gzip.open("./cc.en.300.vec.gz", "rt") as f:
        for line in f:
            yield line.rstrip('\n')

def get_word2rank(vocab_size=60000):
    word2rank = {}
    line_generator = yield_lines()
    next(line_generator)
    for i, line in enumerate(line_generator):
        if (i + 1) > vocab_size:
            break
        word = line.split(" ")[0]
        word2rank[word] = i
    return word2rank

def is_content_token(token):
    return not token.is_stop and not token.is_punct and token.ent_type_ == ''  # Not named entity

def get_content_words(text, spacy_model):
    spacy_tokenizer = spacy_model.Defaults.create_tokenizer(spacy_model)
    spacy_content_tokens = [token for token in spacy_tokenizer(text) if is_content_token(token)]
    return [token.text for token in spacy_content_tokens]

def get_log_ranks(text, spacy_model, word2rank):
    return [
        np.log(1 + word2rank.get(word, len(word2rank)))
        for word in get_content_words(text, spacy_model)
        if word in word2rank
    ]

def get_word_rank_ratio(original_seq, target_seq, spacy_model, word2rank):
    """
    Return the ratio (in %) of the word rank of the target sequence
    to the word rank of the original sequence.
    """    
    orig_log_ranks = get_log_ranks(original_seq, spacy_model, word2rank)
    target_log_ranks = get_log_ranks(target_seq, spacy_model, word2rank)
    if len(orig_log_ranks) == 0:
        orig_log_ranks = [np.log(1 + len(word2rank))]
    if len(target_log_ranks) == 0:
        target_log_ranks = [np.log(1 + len(word2rank))]
    
    orig_log_rank = np.quantile(orig_log_ranks, 0.75)
    target_log_rank = np.quantile(target_log_ranks, 0.75)
    
    return (target_log_rank / orig_log_rank) * 100

**Dependency Tree Depth Ratio**

In [44]:
def get_subtree_depth(node):
    if len(list(node.children)) == 0:
        return 0
    return 1 + max([get_subtree_depth(child) for child in node.children])


def get_dependency_tree_depth_ratio(original_seq, target_seq, model):
    """
    Return the ratio (in %) of the depth of the dependency tree of the target 
    sequence to the depth of the dependency tree of the original sequence.
    """
    original_tree_depths = [
        get_subtree_depth(spacy_sentence.root)
        for spacy_sentence in model(str(original_seq)).sents
    ]
    target_tree_depths = [
        get_subtree_depth(spacy_sentence.root)
        for spacy_sentence in model(str(target_seq)).sents
    ]
    original_tree_depth = 0 if len(original_tree_depths) == 0 else max(original_tree_depths)
    target_tree_depth = 0 if len(target_tree_depths) == 0 else max(target_tree_depths)

    return 0 if original_tree_depth == 0 else (target_tree_depth / original_tree_depth) * 100

**Prepend the paraphrases with Control Tokens**

In [52]:
def prepend_control_tokens(paraphrases):
    """
    Return the list of paraphrases where each original sequence is prepended
    by the control tokens.
    """
    spacy_model = en_core_web_md.load()
    word2rank = get_word2rank(vocab_size=60000)
    final_pps = []
    for orig_seq, target_seq in paraphrases:
        tokens = []
        tokens.append(get_character_length_ratio(orig_seq, target_seq))
        tokens.append(get_replace_only_levenshtein_similarity(orig_seq, target_seq))
        tokens.append(get_word_rank_ratio(orig_seq, target_seq, spacy_model, word2rank))
        tokens.append(get_dependency_tree_depth_ratio(orig_seq, target_seq, spacy_model))
        
        # Round the ratios in a fixed interval of 0.05 (5%) and capped to 
        # a maximum ratio of 2 (200%)
        mod_tokens = []
        for token in tokens:
            token = round(token / 5) * 5
            token = min(max(5, token), 200)
            mod_tokens.append(token)

        CTRL_TOKEN = "<NbChars_{:.0f}%> <LevSim_{:.0f}%> <WordFreq_{:.0f}%> <DepTreeDepth_{:.0f}%> ".format(
            mod_tokens[0], mod_tokens[1], mod_tokens[2], mod_tokens[3]
        )
        orig_seq = CTRL_TOKEN + orig_seq
        final_pps.append((orig_seq, target_seq))
        
    return final_pps


def prepend_control_tokens_for_inference(sentence, tokens):
    """
    Return the sentence encoded with the control tokens for inference
    """
    CTRL_TOKEN = "<NbChars_{:.0f}%> <LevSim_{:.0f}%> <WordFreq_{:.0f}%> <DepTreeDepth_{:.0f}%> ".format(
        tokens[0], tokens[1], tokens[2], tokens[3]
    )
    return CTRL_TOKEN + sentence

---

### Leveraging Unsupervised Pretraining

In [18]:
!pip install transformers



In [19]:
import torch

from torch.utils.data import DataLoader, Dataset
from transformers import BartTokenizerFast, BartForConditionalGeneration
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [20]:
class PPDataset(Dataset):
    """
    Custom dataset class for paraphrase generation
    """
    def __init__(self, data, tokenizer):
        self.encodings = {'input_ids': [], 'labels': []}
        for pp in data:
            source, target = pp
            source = tokenizer.encode(source)
            with tokenizer.as_target_tokenizer():
                target = tokenizer.encode(target)
            self.encodings['input_ids'].append(source)
            self.encodings['labels'].append(target)
    
    def __len__(self):
        return len(self.encodings['input_ids'])
    
    def __getitem__(self, index):
        item = {key: val[index] for key, val in self.encodings.items()}
        return item

In [60]:
def get_tokenizer():
    """
    Return the pretrained BART tokenizer and add the control tokens to it
    """
    tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-large')
    control_tokens = []
    for token in ['NbChars', 'LevSim', 'WordFreq', 'DepTreeDepth']:
        for i in range(5, 201, 5):
            control_tokens.append(f'<{token}_{i}%>')
    tokenizer.add_tokens(control_tokens)

    return tokenizer


def get_dataset(paraphrases, tokenizer):
    """
    Create a dataset from the paraphrases
    """
    dataset = PPDataset(paraphrases, tokenizer)
    return dataset


def get_model(vocab_size):
    """
    Return the pretrained BART model and add fix the token embeddings matrix
    """
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
    model.resize_token_embeddings(vocab_size)
    return model


def get_training_arguments(epochs=10, batch_size=8):
    """
    Rturn the training arguments
    """
    args = Seq2SeqTrainingArguments(
        output_dir = 'outputs',
        learning_rate = 3e-5,
        per_device_train_batch_size = batch_size,
        weight_decay = 0.01,
        num_train_epochs = epochs,
        predict_with_generate = True
    )
    return args


def get_data_collator(tokenizer, model):
    """
    Return the data collator for seq2seq model
    """
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
    return data_collator


def get_trainer(model, tokenizer, dataset, data_collator, training_arguments):
    """
    Return the trainer for fine-tuning the pretrained BART model
    """
    trainer = Seq2SeqTrainer(
        model = model,
        data_collator = data_collator,
        args = training_arguments,
        train_dataset = dataset,
        tokenizer = tokenizer,
    )
    return trainer


def simplify(sentence, tokenizer, model):
    """
    Return the simplified sentence
    """
    tokenized_sentence = tokenizer.encode(sentence, return_tensors='pt')
    output = model.generate(tokenized_sentence, num_beams=5)
    output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in output]
    return output


def decode_sentence(encoded_sentence, tokenizer):
    """
    Decode the encoded sentence
    """
    decoded_sentence = tokenizer.decode(encoded_sentence, skip_special_tokens=True)
    return decoded_sentence

---

### Training in action

**Mine Paraphrases**

In [33]:
data = get_data()

In [34]:
sentences = []
for doc in data:
    sentences += doc_to_sentences(doc)

sentences = filter_sentences(sentences, punc_ratio=0.1, lang_model=None, lang_prob=0.5)
sentences.append('This paragraph is tough to comprehend.')
sentences.append('This paragraph is very hard to understand.')
sequences = generate_sequences(sentences, max_chars=300)

In [36]:
embeddings = np.ascontiguousarray(compute_embeddings(sequences, dim=512))

In [37]:
emb_index = index_embeddings(embeddings)

In [38]:
D, I = get_nearest_neighbors(emb_index, embeddings, k=8)
filtered_I = filter_nearest_neighbors(D, I, max_L2_dist=0.05)
filtered_I = filter_paraphrases(filtered_I, sequences, min_l_dist=0.2)
paraphrases = generate_aligned_paraphrases(filtered_I, sequences)

In [67]:
print(len(final_pps))
for source, target in final_pps:
    print('Simple:', source)
    print('Complex:', target, '\n')

72
Simple: <NbChars_105%> <LevSim_100%> <WordFreq_100%> <DepTreeDepth_100%> The term Hindu was later used in some Sanskrit texts such as the later Rajataranginis of Kashmir (Hinduka, c. 1450) and some 16th- to 18th-century Bengali Gaudiya Vaishnava texts including Chaitanya Charitamrita and Chaitanya Bhagavata.
Complex: The term Hindu was later used occasionally in some Sanskrit texts such as the later Rajataranginis of Kashmir (Hinduka, c. 1450) and some 16th- to 18th-century Bengali Gaudiya Vaishnava texts, including Chaitanya Charitamrita and Chaitanya Bhagavata. 

Simple: <NbChars_100%> <LevSim_95%> <WordFreq_100%> <DepTreeDepth_120%> These texts used to distinguish Hindus from Muslims who are called Yavanas (foreigners) or Mlecchas (barbarians), with the 16th-century Chaitanya Charitamrita text and the 17th century Bhakta Mala text using the phrase "Hindu dharma".
Complex: These texts used it to contrast Hindus from Muslims who are called Yavanas (foreigners) or Mlecchas (barbaria

**Add Control Tokens**

In [53]:
final_pps = prepend_control_tokens(paraphrases)

**Fine-tune BART for Text Simlplification**

In [55]:
tokenizer = get_tokenizer()

In [56]:
dataset = get_dataset(final_pps, tokenizer)

In [57]:
model = get_model(len(tokenizer))

In [None]:
data_collator = get_data_collator(tokenizer, model)
args = get_training_arguments(epochs=1, batch_size=8)
trainer = get_trainer(model, tokenizer, dataset, data_collator, args)

In [65]:
trainer.train()

***** Running training *****
  Num examples = 72
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=9, training_loss=0.9259072409735786, metrics={'train_runtime': 278.6859, 'train_samples_per_second': 0.258, 'train_steps_per_second': 0.032, 'total_flos': 20367397158912.0, 'train_loss': 0.9259072409735786, 'epoch': 1.0})

---