# How to paraphrase text using tranformers in python
In this notebook we will learn how to **Paraphrase text using transformer**.For this , we will be using **Google Pegasus Model**.
Before getting started , let's be clear with the following terms:
1. **Paraphrase :**  means to restate the sentence in your own words while preserving its original meaning or message in definition language
Eg: She could see that the cats were watching her very suspiciously.
Paraphrased : She observed that the felines were watching her with suspicion.
2. **Google Pegasus model:** PEGASUS proposes a transformer-based model for abstractive summarization. It uses a special self-supervised pre-training objective called gap-sentences generation (GSG) that's designed to perform well on summarization-related downstream tasks

**Transformer**  is a highly influential deep learning architecture based on self-attention mechanisms, introduced by Google in 2017. It replaced recurrent models and enabled parallelized training of large language models on massive datasets. Transformer-based pre-trained models like BERT and GPT have achieved state-of-the-art performance across various NLP and other tasks.

In [5]:
#Install libraries
! pip install sentence-splitter #used in nlp for splitting text into sentences i.e para to sentence
! pip install transformers  # for working with transformer models, created and maintained by the Hugging Face team
! pip install SentencePiece # is a language-agnostic library for text tokenization and detokenization i.e  tokenizing text into subword units




In [7]:
import torch #imports the PyTorch library, used in building and training deep neural networks
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
# PegasusForConditionalGeneration generates an output summary conditioned on the input text.
# PegasusTokenizer are used to preprocess text data by converting it into a sequence of tokens

In [8]:
model = PegasusForConditionalGeneration.from_pretrained('tuner007/pegasus_paraphrase')
tokenizer = PegasusTokenizer.from_pretrained('tuner007/pegasus_paraphrase')

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

**Tokenization**

In [22]:
text = "God is all merciful and powerful"
batch = tokenizer([text], padding=True , truncation=True , max_length=60,return_tensors='pt')
#[text] passes the text as a list to the tokenizer , padding=True pads the sequence to a fixed length , truncation=True truncates sequences longer than max_length, max_length=60 sets the maximum length of the input sequence to 60 tokens, return_tensors='pt' returns the tokenized output as PyTorch tensors
output = model.generate(**batch, max_length=60 , num_beams=5, num_return_sequences=5,temperature=1.5)
# **batch passes the tokenized input batch to the model ,  max_length=60 sets the maximum length of the generated summary to 60 tokens , num_beams=5 uses beam search with a beam width of 5 (generates 5 summaries),num_return_sequences=5 returns 5 generated summary sequence, temperature parameter controls the randomness or creativity of the text generation process in language models.

In [23]:
results = tokenizer.batch_decode(output,skip_special_tokens=True)
results


['God is powerful.',
 'God is powerful and compassionate.',
 'God is powerful and mercy.',
 'God is very powerful.',
 'God is very powerful and compassionate.']

**Save the model and tokenizer**




In [24]:
model.save_pretrained("/content/drive/MyDrive/model")
tokenizer.save_pretrained("/content/drive/MyDrive/tokenizer")

Non-default generation parameters: {'max_length': 60, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


('/content/drive/MyDrive/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/tokenizer/spiece.model',
 '/content/drive/MyDrive/tokenizer/added_tokens.json')

**Predictive System(Generate Paraphrase)**

In [37]:
def get_response(input_text, num_return_sequences, num_beams): # 'def' defines a function, 'get_response' is the function name, 'input_text' is the input text string, 'num_return_sequences' is the number of output sequences to generate, and 'num_beams' is the number of beams for beam search decoding.
    batch = tokenizer([input_text], truncation=True, padding='longest', max_length=60, return_tensors="pt") # 'tokenizer' is a function that converts the input text into numerical token IDs, 'truncation=True' truncates the input if longer than 'max_length', 'padding='longest'' pads the input to the longest sequence in the batch, 'max_length=60' sets the maximum input length, 'return_tensors="pt"' returns PyTorch tensors.
    translated = model.generate(**batch, max_length=60, num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5) # 'model.generate' is a function that generates output sequences using a pre-trained language model, '**batch' unpacks the 'batch' dictionary to pass tokenized input tensors to the model, 'max_length=60' sets the maximum output length, 'num_beams=num_beams' is the number of beams for beam search decoding, 'num_return_sequences=num_return_sequences' is the number of output sequences to generate, 'temperature=1.5' controls the randomness of the output (higher values produce more diverse but potentially less coherent text).
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True) # 'tokenizer.batch_decode' decodes the generated output tensors back into text sequences, 'skip_special_tokens=True' removes special tokens like '[CLS]', '[SEP]', or padding tokens from the output text.
    return tgt_text # 'return' returns the list of generated text sequences ('tgt_text') as the output of the function.

In [39]:
num_beams = 10 # Assign the value 10 to the variable num_beams
num_return_sequences = 10 # Assign the value 10 to the variable num_return_sequences
context = "Practise can control the restless mind" # Assign the string "Practise can control the restless mind" to the variable context

get_response(context, num_return_sequences, num_beams) # Call the get_response function with the following arguments:
   # context: the input text string
   # num_return_sequences: the number of output sequences to generate (10)
   # num_beams: the number of beams for beam search decoding (10)


['It is possible to control the restless mind.',
 "It's possible to control the restless mind.",
 'It is possible to control the restless mind by practicing.',
 'The restless mind can be controlled by practise.',
 "It's possible to control the restless mind by practicing.",
 'The restless mind can be controlled with practise.',
 'It is possible to control the restless mind with practise.',
 "It's possible to control the restless mind with practise.",
 'It is possible to control the restless mind with practice.',
 "It's possible to control the restless mind with practice."]