<a href="https://colab.research.google.com/github/tomradch/MSCIDS_Computational_Language_Technologies/blob/main/6_0_text_style_transfer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Style Transfer with HuggingFace - Let's become polite in our emails!

In this notebook, we will address the problem text style transfer: given an input document, how can we rewrite it in another predefined style?

We will use the data for Politeness transfer from the paper https://arxiv.org/pdf/2004.14257.pdf .

Let's see some examples:

**From non-polite to polite**

*Input*: Send me the data.

*Output*: Could you please send me the data?


**From negative to positive**

*Input*: Their chips are ok, but their salsa is really bland.

*Output*: Their chips are great, but their salsa is really delicious.

In [None]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Before starting, let's verify that we have a GPU available. If not, please change the runtime type **Runtime -> change runtime type -> hardware accelerator -> GPU**.

In [None]:
!nvidia-smi

Wed May 24 08:07:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's install the transformers library

In [None]:
!pip install transformers
!pip install datasets
!pip install rouge_score
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.19.0


# Read and split data

We will use the processed data from the paper mentioned above. It consists of 265k paired-samples with non-polite and polite sentences. Overall, the data is noisy but that's not a problem because we will leverage the recent advance with transformers with noisy inputs!

 Let's download the dataset and see some examples

In [None]:
!wget https://www.dropbox.com/s/cnsdfl5ndyp5q7e/politeness_data_2.zip

--2023-05-24 08:10:01--  https://www.dropbox.com/s/cnsdfl5ndyp5q7e/politeness_data_2.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:6019:18::a27d:412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/cnsdfl5ndyp5q7e/politeness_data_2.zip [following]
--2023-05-24 08:10:01--  https://www.dropbox.com/s/raw/cnsdfl5ndyp5q7e/politeness_data_2.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc8079f52546688d1c49632edeb0.dl.dropboxusercontent.com/cd/0/inline/B8pMI5JWTEfZbWyWzUhA1PPu6GBDLlNieOKccnxwRHJ9DFqzUmSqZN36bpM9DnaKgmYbVGXwRS1acqJhHJRKldUBEDWtNYGZwMWzhpaaSAHjV34sy0qpN8-40J8VOT0XqcCYbTwcFMskDyB1JLZwQHyxHjSPZlh-pNCa_MpcukEI5g/file# [following]
--2023-05-24 08:10:01--  https://uc8079f52546688d1c49632edeb0.dl.dropboxusercontent.com/cd/0/inline/B8pMI5JWTEfZbWyWzUhA1PPu6GBDLlNieOKccnxwRHJ9DFqzUmSqZ

In [None]:
!rm -r __MACOSX
!rm entagged_parallel.test.en entagged_parallel.dev.en entagged_parallel.train.en
!rm engenerated_parallel.dev.generated engenerated_parallel.test.generated engenerated_parallel.train.generated
!unzip politeness_data_2.zip

rm: cannot remove '__MACOSX': No such file or directory
Archive:  politeness_data_2.zip
  inflating: engenerated_parallel.dev.generated  
  inflating: engenerated_parallel.test.generated  
  inflating: engenerated_parallel.train.generated  
  inflating: entagged_parallel.test.en  
  inflating: entagged_parallel.train.en  
  inflating: entagged_parallel.dev.en  


In [None]:
!ls

dev.csv				      entagged_parallel.train.en
engenerated_parallel.dev.generated    politeness_data_2.zip
engenerated_parallel.test.generated   politeness_data_2.zip.1
engenerated_parallel.train.generated  sample_data
entagged_parallel.dev.en	      test.csv
entagged_parallel.test.en	      train.csv


Let's filter out long sentences and do some preprocessing.

In [None]:
# Let's load the data

data = {}

for split in ['train', 'dev', 'test']:
  data[split] = []

  # The inputs
  inputs = []
  with open('entagged_parallel.{}.en'.format(split), 'r') as fp:
    for line in fp:
      line = line.strip()
      if len(line) > 0:
        inputs.append(line)
  
  # The outputs
  outputs = []
  with open('engenerated_parallel.{}.generated'.format(split), 'r') as fp:
    for line in fp:
      line = line.strip()
      if len(line) > 0:
        outputs.append(line)
  
  data[split] = list(zip(inputs, outputs))
  print('{}: {} samples'.format(split, len(data[split])))

train: 212394 samples
dev: 26705 samples
test: 25790 samples


In [None]:
data['test'][10]

("it 's time to get an early [P_92] that [P_93] 's resolution .",
 "it 's time to get an early start on that new year 's resolution .")

As you can see, the dataset contains tags such as [P_92] or [P_93]. Those serve helping the model to understand where to insert polite phrases and what types. If you are more interested in knowin how those tags are generated, please refer the [paper](https://arxiv.org/pdf/2004.14257.pdf) . 

Too keep it simple with Huggingface, we will write the datasets into CSV files where the first column correspond to the input and the second the output. 

For simplicity, let's keep only 2500 samples for training, validation, and testing.

In [None]:
import csv

idx = 0
for split_data, samples in data.items():
  with open('{}.csv'.format(split_data), 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['idx', 'nonpolite', 'polite'])

    for sample in samples[:2500]:
      writer.writerow([idx] + list(sample))
      idx += 1

# Let's prepare the by-default parameters that we can use

In [None]:
from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed,
)
import transformers
from datasets import load_dataset, load_metric

ModuleNotFoundError: ignored

In [None]:
from transformers import TrainingArguments, HfArgumentParser
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

We prepare the different parameters we can use with Huggingface and the model

In [None]:

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
            "with private models)."
        },
    )
    resize_position_embeddings: Optional[bool] = field(
        default=None,
        metadata={
            "help": "Whether to automatically resize the position embeddings if `max_source_length` exceeds "
            "the model's position embeddings."
        },
    )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    lang: str = field(default=None, metadata={"help": "Language id for summarization."})

    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    text_column: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
    )
    summary_column: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
    )
    train_file: Optional[str] = field(
        default='train.csv', 
        metadata={"help": "The input training data file (a jsonlines or csv file)."}
    )
    validation_file: Optional[str] = field(
        default='dev.csv',
        metadata={
            "help": "An optional input evaluation data file to evaluate the metrics (rouge) on "
            "(a jsonlines or csv file)."
        },
    )
    test_file: Optional[str] = field(
        default='test.csv',
        metadata={
            "help": "An optional input test data file to evaluate the metrics (rouge) on " "(a jsonlines or csv file)."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_source_length: Optional[int] = field(
        default=64,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    max_target_length: Optional[int] = field(
        default=64,
        metadata={
            "help": "The maximum total sequence length for target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    val_max_target_length: Optional[int] = field(
        default=None,
        metadata={
            "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
            "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
            "during ``evaluate`` and ``predict``."
        },
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": "Whether to pad all samples to model maximum sentence length. "
            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
            "efficient on GPU but very bad for TPU."
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
            "value if set."
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
            "value if set."
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this "
            "value if set."
        },
    )
    num_beams: Optional[int] = field(
        default=None,
        metadata={
            "help": "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
            "which is used during ``evaluate`` and ``predict``."
        },
    )
    ignore_pad_token_for_loss: bool = field(
        default=True,
        metadata={
            "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
        },
    )
    source_prefix: Optional[str] = field(
        default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
    )

    forced_bos_token: Optional[str] = field(
        default=None,
        metadata={
            "help": "The token to force as the first generated token after the decoder_start_token_id."
            "Useful for multilingual models like mBART where the first generated token"
            "needs to be the target language token (Usually it is the target language token)"
        },
    )

In [None]:
import sys

sys.argv[1] = "--output_dir=."
sys.argv[2] = '--model_name_or_path=facebook/bart-base' # We specify our model here
#sys.argv[2] = '--model_name_or_path=t5-small'
sys.argv.append('--per_device_train_batch_size=64')
sys.argv.append('--per_device_eval_batch_size=64')
sys.argv.append('--predict_with_generate')

In [None]:
# We process the parameters
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

In [None]:
# Setup logging
logger = logging.getLogger(__name__)

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

# Log on each process the small summary:
logger.warning(
    f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
    + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
logger.info(f"Training/evaluation parameters {training_args}")




In [None]:
# Set seed before initializing model.
set_seed(training_args.seed)

# Load the data with huggingface


In [None]:
# Load the data for Huggingface

data_files = {
    "train": data_args.train_file,
    "validation": data_args.validation_file,
    "test": data_args.test_file,
}
extension = data_args.test_file.split(".")[-1]

raw_datasets = load_dataset(
    extension,
    data_files=data_files,
    cache_dir=model_args.cache_dir,
    use_auth_token=None,
    streaming=False,
)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-8c83e23eacc7041e/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-8c83e23eacc7041e/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# Load pre-trained model

In [None]:
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=model_args.use_fast_tokenizer,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_args.model_name_or_path,
    from_tf=bool(".ckpt" in model_args.model_name_or_path),
    config=config,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

model.resize_token_embeddings(len(tokenizer))


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

Embedding(50265, 768, padding_idx=1)

In [None]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

# Prepare the data for our model (BART)

In [None]:
column_names = raw_datasets["train"].column_names

input_column = 'nonpolite'
output_column = 'polite'

# Temporarily set max_target_length for training.
max_target_length = data_args.max_target_length
padding = "max_length" if data_args.pad_to_max_length else False

In [None]:
def preprocess_function(examples):
    inputs, targets = [], []
    for i in range(len(examples[input_column])):
        if examples[input_column][i] is not None and examples[output_column][i] is not None:
            inputs.append(examples[input_column][i])
            targets.append(examples[output_column][i])

    model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length" and data_args.ignore_pad_token_for_loss:
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

We now tokenize our data for each split

In [None]:
# Run the tokenizer for the training
train_dataset = raw_datasets["train"]

with training_args.main_process_first(desc="train dataset map pre-processing"):
    train_dataset = train_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        load_from_cache_file=not data_args.overwrite_cache,
        desc="Running tokenizer on train dataset",
    )


Running tokenizer on train dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]



In [None]:
# Run the tokenizer for the validation
max_target_length = data_args.val_max_target_length
eval_dataset = raw_datasets["validation"]

with training_args.main_process_first(desc="validation dataset map pre-processing"):
    eval_dataset = eval_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        load_from_cache_file=not data_args.overwrite_cache,
        desc="Running tokenizer on validation dataset",
    )

Running tokenizer on validation dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]

In [None]:
max_target_length = data_args.val_max_target_length
predict_dataset = raw_datasets["test"]

with training_args.main_process_first(desc="prediction dataset map pre-processing"):
    predict_dataset = predict_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        load_from_cache_file=not data_args.overwrite_cache,
        desc="Running tokenizer on prediction dataset",
    )

Running tokenizer on prediction dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]

We initialize how we are going to create batches with each sample. We use the DataCollator class for that.

In [None]:
# Data collator
label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8 if training_args.fp16 else None,
)

# Prepare the metric and its computation

In [None]:
import nltk
import numpy as np

# Metric
metric = load_metric("rouge")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels
    
def compute_metrics(eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    if data_args.ignore_pad_token_for_loss:
        # Replace -100 in the labels as we can't decode them.
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

# Train!

In [None]:
# Initialize our Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
model.device

device(type='cuda', index=0)

In [None]:
# Training
checkpoint = None
last_checkpoint = None

if training_args.resume_from_checkpoint is not None:
    checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
    checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()  # Saves the tokenizer too for easy upload

metrics = train_result.metrics
max_train_samples = (
    data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()



Step,Training Loss


Load our trained model

In [None]:
!wget https://www.dropbox.com/s/w2li5om9idwzwjb/checkpoint-19500.zip
!unzip checkpoint-19500.zip

--2023-05-24 08:12:24--  https://www.dropbox.com/s/w2li5om9idwzwjb/checkpoint-19500.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:6019:18::a27d:412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/w2li5om9idwzwjb/checkpoint-19500.zip [following]
--2023-05-24 08:12:25--  https://www.dropbox.com/s/raw/w2li5om9idwzwjb/checkpoint-19500.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc8ab39ec3209cfea82056757c3f.dl.dropboxusercontent.com/cd/0/inline/B8rfon-fkGLYdmb0m6huVHnx-nrqkMdDYRmi44JSs8oLaLdr0UC0oCFbh9aspAPDlOiKDjG07Z3DHgWNwouoiaR_ve2k3d4OTsChP1x0bXePXIwxTv9H9cOcId0iYFpt5eQf45JqPftbEcvjunvc5t55E8MuTKFWL3Di68gujhfzPg/file# [following]
--2023-05-24 08:12:27--  https://uc8ab39ec3209cfea82056757c3f.dl.dropboxusercontent.com/cd/0/inline/B8rfon-fkGLYdmb0m6huVHnx-nrqkMdDYRmi44JSs8oLaLdr0UC0oC

In [None]:
# We change the parameter and load the model
import sys
import nltk
nltk.download('punkt')

sys.argv[1] = "--output_dir=."
sys.argv[2] = '--model_name_or_path=checkpoint-19500' # We specify our model PATH here!
sys.argv.append('--per_device_eval_batch_size=64')
sys.argv.append('--predict_with_generate')

# We process the parameters
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()


config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=model_args.use_fast_tokenizer,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_args.model_name_or_path,
    from_tf=bool(".ckpt" in model_args.model_name_or_path),
    config=config,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

model.resize_token_embeddings(len(tokenizer))

# Initialize our Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Evaluation on the validation set

In [None]:
# Evaluation
results = {}
max_length = (
    training_args.generation_max_length
    if training_args.generation_max_length is not None
    else data_args.val_max_target_length
)
num_beams = data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams

logger.info("*** Evaluate ***")
metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** eval metrics *****
  eval_gen_len            =    18.1296
  eval_loss               =     0.4163
  eval_rouge1             =    73.3844
  eval_rouge2             =     63.909
  eval_rougeL             =    72.9543
  eval_rougeLsum          =    72.9553
  eval_runtime            = 0:01:13.35
  eval_samples            =       2500
  eval_samples_per_second =     34.081
  eval_steps_per_second   =      0.545


# Generate our new polite emails! (Test set)

In [None]:
logger.info("*** Predict ***")

predict_results = trainer.predict(
    predict_dataset, metric_key_prefix="predict", max_length=max_length, num_beams=num_beams
)
metrics = predict_results.metrics
max_predict_samples = (
    data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset)
)
metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset))

trainer.log_metrics("predict", metrics)
trainer.save_metrics("predict", metrics)

predictions = tokenizer.batch_decode(
    predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
predictions = [pred.strip() for pred in predictions]
output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt")
with open(output_prediction_file, "w") as writer:
    writer.write("\n".join(predictions))

***** predict metrics *****
  predict_gen_len            =    18.1064
  predict_loss               =     0.4033
  predict_rouge1             =     73.799
  predict_rouge2             =    64.6458
  predict_rougeL             =    73.3562
  predict_rougeLsum          =    73.3531
  predict_runtime            = 0:01:16.73
  predict_samples            =       2500
  predict_samples_per_second =     32.579
  predict_steps_per_second   =      0.521


# Let's see some outputs

In [None]:
raw_datasets["test"][0]

{'idx': 5000,
 'nonpolite': 'i do not [P_91] issues and [P_93] or possibly late this afternoon .',
 'polite': 'i do not think there will be any issues and should have it ready by tomorrow or possibly late this afternoon .'}

In [None]:
golds = []
for gold in predict_results.label_ids:
  gold = [token_id for token_id in gold if token_id != -100] # Remove padding
  gold = tokenizer.decode(gold, skip_special_tokens=True)
  golds.append(gold)

In [None]:
pred_gold = list(zip(predictions, golds, raw_datasets["test"]))

for pred, gold, sample in pred_gold[:10]:
  print('Input: {}'.format(sample['nonpolite']))
  print('Pred : {}'.format(pred))
  print('Gold : {}'.format(gold))
  print()

Input: i do not [P_91] issues and [P_93] or possibly late this afternoon .
Pred : i do not expect to have any issues and will call you tomorrow or possibly late this
Gold : i do not think there will be any issues and should have it ready by tomorrow or possibly late this afternoon.

Input: [P_90] concord [P_91] for 90,000dth between niagara and leidy , 100,000 dth .
Pred : we will concord with your request for 90,000dth between niagara and
Gold : it through concord is available for 90,000dth between niagara and leidy, 100,000 dth.

Input: we are posting 50,000 dth excess injection [P_92] or west [P_94] we have already seen interest from shippers with primary rights to use that space .
Pred : we are posting 50,000 dth excess injection capacity in east or west and we
Gold : we are posting 50,000 dth excess injection on either the east or west side but we have already seen interest from shippers with primary rights to use that space.

Input: i [P_90] to remind you that our firm transport 