# Simple Paraphrase Tool 

This is a simple tool to paraphrase text using the [__*ramsrigouthamg/t5-large-paraphraser-diverse-high-quality*__](https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality) pre-trained model, provided via [huggingface](https://huggingface.co/). 



### Installing the necessary packages 

We require transformers (to communicate with huggingface) and sentencepiece, which the t5 model uses.

*Protobuf* is not a required module to install, if you're using google colab. I ran this on github codespaces, therefore protobuf was not pre-present.

In [None]:
%pip install sentencepiece 
%pip install transformers
%pip install protobuf 

In [2]:
# Import the two necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

### The Paraphrase Class 

In all simplicity, we're using a pre-trained model in order to rewrite phrases. 

`self.device` is set to "cuda" for gpu usage, if torch is unable to find the gpu, it uses cpu instead. 

**The rewrite fuction** outputs a `list` of paraphrased sentences to choose the most appropriate (grammatically & meaning-wise). 

In [3]:
class Paraphrase: 
    def __init__(self): 
       self.model = AutoModelForSeq2SeqLM.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality") # This model is trained on a large dataset of paraphrases and is able to generate high quality paraphrases.
       self.tokenizer = AutoTokenizer.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality")
       self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use gpu if available else use cpu
       self.model = self.model.to(self.device) # load model into gpu if available else load into cpu
    
    # Simple rewrite function, takes in a string and returns a list of paraphrased sentences
    def rewrite(self, text) -> list :
       encoding = self.tokenizer.encode_plus(text,max_length =128, padding=True, return_tensors="pt") # encode input text into tokenized ids and attention mask tensors 
       input_ids,attention_mask  = encoding["input_ids"].to(self.device), encoding["attention_mask"].to(self.device) # load tensors into gpu if available else load into cpu 
       self.model.eval() # set model to evaluation mode
       # generate paraphrases using beam search with beam size 5, beam groups 5, and diversity penalty 0.70
       diverse_beam_outputs = self.model.generate(
       input_ids=input_ids,attention_mask=attention_mask, 
       max_length=128, # maximum length of generated paraphrase (can be changed)
       early_stopping=True, # stop generation when all beam hypotheses reach end of sentence token (EOS)
       num_beams=5, # number of beams to use for beam search (Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is a greedy algorithm that expands the search space by one node in all possible directions.)
       num_beam_groups = 5,
       num_return_sequences=5,
       diversity_penalty = 0.70) # higher penalty means more diverse paraphrases
       phc = [] 
       for beam_output in diverse_beam_outputs: # iterate through each paraphrased sentence
             sent = self.tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True) # decode tokenized ids into paraphrased sentence. Skips special tokens and cleans up tokenization spaces
             if sent.lower() != text.lower() and sent not in phc:
                phc.append(sent)
       return phc
       



Now, lets run the model! 

- Your output might be something like: 

```json
["paraphrasedoutput: I don't even know if I'll make it to the party.", ....]
```
- As you might infer from this, we need to remove *paraphrasedoutput* from the array of strings.

- So we create a function called `extract_sentences`, that basically extracts the string "paraphrasedoutput" from each item (str) in the array.

- Next we need to find out the most appropriate paraphrase out of the array of sentences. 
  And to do just that, we use a library called `language-tool-python`

- `language-tool-python` basically checks the grammatically outline of the sentences and chooses the most apt out of the lot. 

In [None]:
text = "The quick brown fox jumps over the lazy dog."

# Run model

para = Paraphrase() # Load model
stxs = para.rewrite(text) # Rewrite text

In [None]:
%pip install language-tool-python # install language tool for grammar check 

In [6]:
# Extract the sentences from the json string output 

def extract_sentences(sentences) -> list :
    sens = [] 
    for sentence in sentences: 
        cleaned_sentence = sentence.replace("paraphrasedoutput:", "").strip() # remove the prefix 
        sens.append(cleaned_sentence)
    return sens


In [7]:
from language_tool_python import LanguageTool

def appropriation(sentences):
    tool = LanguageTool('en-US')  # Grammar checker

    best_sentence = ""
    best_grammar_score = float('-inf')

    for sentence in sentences:
     matches = tool.check(sentence)
     grammar_score = len(matches)
    
    if grammar_score > best_grammar_score:
        best_grammar_score = grammar_score
        best_sentence = sentence
    
    return best_sentence


In [None]:
btx = extract_sentences(stxs)
print(appropriation(btx))