<a href="https://colab.research.google.com/github/s-c-soma/AdvanceDeeplearning-CMPE-297/blob/master/Assignment_5/ExtraCredit_Assignment_5d_Paraphrasing_for_Data_Augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation by Paraphrasing

## Implementation Details and Discussion

In this notebook I have implemented data augmentation for paraphrasing.
Implemented model is a paraphrasing model, which takes as input a sentence and generates multiple paraphrased versions of the same sentence.

For example:

input message: I want to book an appointment for tomorrow.

### Generated Paraphrases
---------------------

- I want to book an appointment for tomorrow.
- i need to book an appointment for another meeting.

From the output we can see that both of them are grammatically and semantically meaningful.

In [None]:
!pip install torch torchvision transformers==2.10.0 rasa==1.10.0 input_reader

## Imports

In [None]:
import ipywidgets as widgets
import requests, os
from IPython.display import display
from ipywidgets import interact

from rasa.nlu.training_data import TrainingData,Message

## Load Data


In [None]:
def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

model_class_file_id = '1N1kn2b7i2ND7eNefzyJM-k13IM8tqZvr'
checkpoint_file_id = '1G0nwXlvzGsb8Ar-OAnYBQKFvY97WMzBy'
model_class_destination = 'model.py'
checkpoint_destination = 'model.zip'
checkpoint_unzipped_destination = 'package_models'

if not os.path.exists(checkpoint_unzipped_destination):
  download_file_from_google_drive(checkpoint_file_id, checkpoint_destination)
  !unzip {checkpoint_destination}

if not os.path.exists(model_class_destination):
  download_file_from_google_drive(model_class_file_id, model_class_destination)

In [None]:
from model import ParaphraseModel
model_path = 'package_models/lm_finetune_8/checkpoint-56000/'

complete_td = TrainingData()
model = ParaphraseModel(model_path)

### Evaluate

In [None]:
input_phrase = input("Enter a message for which you would like to generate paraphrases: ")

In [None]:
number_samples = int(input("Number of paraphrases to generate: "))
stop_words = input("Stop words to be constrained with(multiple semi-colon separated): ")

In [None]:
paraphrases = model.get_paraphrases(input_phrase, number_samples, stop_words)

In [None]:
print("Steps:\n1. Read all proposed paraphrases below.\n2. Select valid paraphrases that you would like\
 to add to your NLU training data. Use Ctrl/Cmd + Click to select multiple.\n\
3. Enter the name of the intent under which these messages should be categorized\n\
4. Click 'Add to training data'\n\
5. Copy the training data displayed in Rasa Markdown format to your existing training data file.\n\
6. You can go back to 3 cells above this to enter new messages for which you want to generate paraphrases.")

paraphrase_widget = widgets.SelectMultiple(
    options=paraphrases,
    value=[],
    rows=number_samples,
    description='Paraphrases',
    disabled=False,
    layout= widgets.Layout(width='100%')
)
display(paraphrase_widget)

intent = widgets.Text(description="Intent")
display(intent)

button = widgets.Button(description="Add to Training Data")
output = widgets.Output()

display(button, output)

def on_button_clicked(b):
    
    global complete_td
    
    with output:
        intent_value = intent.value
        selected_paraphrases = paraphrase_widget.value
        
        if not len(selected_paraphrases):
            print("Error: You haven't selected any paraphrases")
            return
        if not intent_value:
            print("Error: Please enter the intent name under which these messages should be categorized.")
            return
        
        all_messages = [Message.build(text=input_phrase, intent=intent_value)]
        for paraphrase in selected_paraphrases:
            all_messages.append(Message.build(text=paraphrase,intent=intent_value))
            
        complete_td = complete_td.merge(TrainingData(training_examples=all_messages))
        
        print(complete_td.nlu_as_markdown())

button.on_click(on_button_clicked)