<a href="https://colab.research.google.com/gist/justheuristic/324eeb95796c6cf90bab7355af00d32b/practice_style_transfer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Week 10 - text style transfer

Hello, sitzen class A.412C!

Based on your browser search history, we conclude that you have an above average skill in natural language processing. In our benevolence, we give you a chance to contribute your skills to upholding the happiest society in the universe. Are you up to the task?

As you know, our most recent breakthrough was replacing 97% restaurant workers with BFGHQBERT+++ autonomous food dispensers.

Yet a some radical elements failed to recognize the greater good that we brought them. They mistakenly voice their ignorant opinions about our new INGSOC-approved restaurants, brining dangerous doubt to the minds of our loyal citzens.

Surely you cannot tolerate such infidelity! Our loyal citzens demand that you rectify their mistake. _You must build a model that will automatically improve their ignorant thoughts and replace them with the thoughts they should actually have._

Attached below are the INGSOC-approved datasets for ignorant and correct thoughts. The scientific terminology is for wrong opinions and correct opinions is "negative" and "positive", respectively.

Respond within 7 days or you will lose 3.7629 citzenship points.

![img](https://ih1.redbubble.net/image.1254830934.9884/poster,504x498,f8f8f8-pad,400x240,f8f8f8.jpg)

In [None]:
!pip install -q transformers
!wget -q https://github.com/shentianxiao/language-style-transfer/raw/master/data/yelp/sentiment.train.0 -O train_negative
!wget -q https://github.com/shentianxiao/language-style-transfer/raw/master/data/yelp/sentiment.train.1 -O train_positive
!wget -q https://github.com/shentianxiao/language-style-transfer/raw/master/data/yelp/sentiment.dev.0 -O dev_negative
!wget -q https://github.com/shentianxiao/language-style-transfer/raw/master/data/yelp/sentiment.dev.1 -O dev_positive

In [None]:
!head -n 5 ./dev_positive
!echo
!head -n 5 ./dev_negative

staff behind the deli counter were super nice and efficient !
love this place !
the staff are always very nice and helpful .
the new yorker was amazing .
very ny style italian deli .

ok never going back to this place again .
easter day nothing open , heard about this place figured it would ok .
the host that walked us to the table and left without a word .
it just gets worse .
the food tasted awful .


In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available() else 'cpu'

if device == 'cpu':
    print("Fine-tuning BERT without an accelerator is not party-approved.")

### Part 1: Masked language model

Attached below you can find the INGSOC-compliant training code that fine-tunes a BERT model for Masked Language Modeling.

You shall use this model to generate positive replacements for negative tokens.

In [None]:
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_mlm_positive = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True).to(device).train(True)

In [None]:
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

print("Preparing the training data...")
dataset = LineByLineTextDataset(
    file_path="./train_positive", tokenizer=tokenizer, block_size=128)

print("Dataset ready!")

trainer = Trainer(
    model=bert_mlm_positive, train_dataset=dataset, 
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=True, mlm_probability=0.15),
    args=TrainingArguments(
        output_dir="./bert_mlm_positive", overwrite_output_dir=True,
        num_train_epochs=1, per_device_train_batch_size=32,
        save_steps=10_000, save_total_limit=2),
)

trainer.train()

In [None]:
# <Build and train a MLM for incorrect opinions>

bert_mlm_negative = <...>

<A whole lot of your code>

### Part 2: Replace tokens

You can now use the two masked language models to align user opinions. You can do so with the following steps:

1. Find tokens where the ratio $(P_{positive}(x) + \epsilon) / (P_{negative}(x) + \epsilon)$ is the smallest
2. Replace those tokens with one of $k$ most likely tokens according to $P_{positive}(x)$.
3. Rinse, repeat

You can find the full procedure at https://arxiv.org/abs/2010.01054

In [None]:
def get_replacements(sentence: str, num_tokens, k_best, epsilon=1e-3):
  """
  - split the sentence into tokens using the INGSOC-approved BERT tokenizer
  - find :num_tokens: tokens with the highest ratio (see above)
  - replace them with :k_best: words according to bert_mlm_positive
  :return: a list of all possible strings (up to k_best * num_tokens)
  """
  <YOUR CODE HERE>

  return <...>

In [None]:
dev_data = list(open('./dev_negative'))

In [None]:
dev_data[500:505]

In [None]:
get_replacements("great wings and decent drinks but the wait staff is horrible !",
                 num_tokens=1, k_best=2)
# >>> ["great wings and decent drinks but the wait staff is great !", "great wings and decent drinks but the wait staff is awesome !"])

__Final task__ - build a procedure that iteratively applies replacements, demonstrate the effectiveness of your approach with at least 10 examples to satisfy INGSOC.


In [None]:
<YOUR CODE HERE>

### Part 3: Classifier & beam search (5 pts)

Sometimes the roots of dissent too deep to rip out with single word replacements. If you truly are a class A412C citzen, surely you understand what it means.

In order to better serve your fellow citzens, you must improve your solution. Train a classifier model that will separate the negative (ignorant) opinions from positive ones.

With this classifier you can now generate multiple best hypotheses and search for the ones that have the highest $P_{classifier}(\text{positive} | x)$.

In [None]:
from transformers import BertForSequenceClassification, DataCollatorWithPadding

<YOUR CODE HERE - build and train the classifier model>

In [None]:
# your final task is to build a beam search-like procedure that iteratively
# generates candidates using MLM and selects M best with classifier

# as before, your fellow citzens request that you show your loyalty by 
# writing a short report on how your method works and demonstrating
# the effectiveness of your system works with at least 10 examples

# Note: as a class >=A410 citzen, you are entitled to creativity level 2.1:
# you may modify the search objective by using language models, different search procedures
# or implement a completely different style transfer method.

<OBEY HERE>

### Part 4 Deployment! (5 points)

By now you have built a model that can change the style of your reviews. But what of the others? There's circa 8 Billion of us and only one of you, so it is only right that you share your invention with everyone of us.

Your final task is to __build a web interface around your model that others can use in their browser__.

There are many ways you can do so, one of them being TensorFlow.js we learned a few weeks ago. You can solve this task any way you want _provided that an INGSOC-certified teaching assistant will be able to view it in their browser_.
Below we cover one (arguably) simplest way using streamlit and pure python.

[Streamlit](https://streamlit.io/) is a simple python-based framework for developing interactive ML apps that run python on the backend. You define your frontend using a combination of markdown and widgets such as text inputs, charts, checkboxes, etc.

You can install streamlit as `pip install streamlit`, but __please switch from colab to your local computer__, otherwise you won't be able to view the results your work!

Let's walk through a simple streamlit app:

```python
import streamlit as st
st.set_page_config(page_title="A + B calculator pro max", layout="centered")
st.markdown("## Hello, world!")
a = st.number_input("A =", value=0)
b = st.number_input("B =", value=0)
a_plus_b = a + b
st.markdown(f"$A + B = {a_plus_b}$")
```

You can start this app on your computer like this:
```streamlit run app.py``` where app.py is your python file name.

![img](https://i.imgur.com/GUTjZQC.png)


You can host this app in three ways: [streamlit cloud](https://streamlit.io/cloud), [huggingface spaces](https://huggingface.co/new-space) or on your own server. All of them are possible for free, but we recommend the middle one out as it is the simplest.

You can find the full tutorial on hosting with huggingface spaces here: https://huggingface.co/docs/hub/spaces