# READ THIS BEFORE RUNNING THE CODE

If you are runnning this notebook in **Colab**, it is suggested to set up a working directory in your Google Drive where exist the BERT checkpoints (e.g., '
distilbert-base-cased-train-on-se-checkpoint-867') and the following datasets:

* **StackExchange politeness dataset** (e.g., 'dataset_se_bin_politeness.csv')
* **Wikipedia politeness dataset** (e.g., 'dataset_wk_bin_politeness.csv')
* **Slack dataset** (e.g., 'dataset_slack.csv')


It is suggested to run in **GPU** environment (Runtime -> Change runtime type -> Hardware accelerator).

Example file structure:

<img src="https://raw.githubusercontent.com/sherl9/Language-Power-of-Workplace-Conversation/main/assets/ftree.png" alt="ftree" width="300"/>

# Define user variables

* @POL_DIR: name of the working directory where you store the datasets and checkpoints
* @CKPT_DIR: name of the checkpoint directory  
* @POL_LABEL_FILE: file where you want to store the predicted politeness labels
* @POL_SCORE_FILE: file where you want to store the predicted politeness scores
* @TRAIN, TEST: set 'TRAIN' as 'se' and 'TEST' as 'wk' if you want to use the BERT model trained on SE. This setting should be **aligned** with your choice of checkpoint

In [1]:
POL_DIR  = 'drive/MyDrive/politeness-bert/'

CKPT_DIR = 'distilbert-base-cased-train-on-se-checkpoint-867' # BERT model trained on SE
# CKPT_DIR = 'roberta-large-train-on-wk-checkpoint-573' # BERT model trained on WK

POL_LABEL_FILE = 'predictions_slack_labels.txt'
POL_SCORE_FILE = 'predictions_slack_scores.txt'

TRAIN, TEST = 'se', 'wk'

# Connect to drive folder

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd 'drive/MyDrive/politeness-bert-test/'

/content/drive/MyDrive/politeness-bert-test


# Install essential packages

In [4]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 14.4 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 71.1 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 71.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.4 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 5.7 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.

# Import packages

In [5]:
import numpy as np
from datasets import load_dataset, load_metric, DatasetDict
from transformers import AutoTokenizer
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

# Load politeness datasets

In [6]:
train_file = 'dataset_' + TRAIN + '_bin_politeness.csv'
test_file  = 'dataset_' + TEST + '_bin_politeness.csv'

dataset = load_dataset('csv', data_files={'train': train_file, 'test': test_file})

# Rename dataset columns
dataset = dataset.rename_column('Request', 'text')
dataset = dataset.rename_column('Politeness', 'label')

# Split the train dataset (30% : 70%)
SEED = 99
train_valid = dataset['train'].shuffle(SEED).train_test_split(test_size=0.3)
dataset_split = DatasetDict({
    'train': train_valid['train'],
    'valid': train_valid['test'],
    'test': dataset['test']})

# print(dataset_split)



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-7091a571c50b4af8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-7091a571c50b4af8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

# Verify Corss-Domain Accuracy



In [7]:
checkpoint = CKPT_DIR

# Hyperparameters
RUNTIME = 0
NUM_EPOCHS = 3
BATCH_SIZE = 8
LR = 2e-5
OUTPUT_DIR = f'runtime/run{RUNTIME}'

# Evaluation metrics
metric = load_metric('accuracy')

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)
def my_preprocess(examples):
    return tokenizer(examples['text'], truncation=True)
dataset_split_encoded = dataset_split.map(my_preprocess, batched=True)

# The model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# Trainer
args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    evaluation_strategy = 'epoch',
    logging_strategy = 'epoch',
    save_strategy = 'epoch',
    learning_rate = LR,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = NUM_EPOCHS,
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy'
)

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model_init = model_init,
    args = args,
    train_dataset = dataset_split_encoded['train'].shuffle(SEED),
    eval_dataset = dataset_split_encoded['valid'].shuffle(SEED),
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  # This is added back by InteractiveShellApp.init_path()


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

loading configuration file distilbert-base-cased-train-on-se-checkpoint-867/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased-train-on-se-checkpoint-867",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 28996
}

loading weights file distilbert-base-cased-train-on-se-checkpoint-867/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSeq

In [8]:
# Verify acc
dataset_test_encoded = dataset_split['test'].map(my_preprocess, batched=True)
test_logits = trainer.predict(dataset_test_encoded)
test_predictions = np.argmax(test_logits[0], axis=1)
test_references = np.array(dataset_test_encoded['label'])
test_acc = metric.compute(predictions=test_predictions, references=test_references)

print('##########################################################################################')
print(test_acc)
print('##########################################################################################')

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2176
  Batch size = 8
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


##########################################################################################
{'accuracy': 0.8216911764705882}
##########################################################################################


# Predict on Slack

In [9]:
# Load slack dataset
dataset_sk = load_dataset('csv', data_files='dataset_slack.csv')
dataset_sk_encoded = dataset_sk['train'].map(my_preprocess, batched=True)
sk_logits = trainer.predict(dataset_sk_encoded)
sk_predictions = np.argmax(sk_logits[0], axis=1)



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-fa102b759cce16b5/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-fa102b759cce16b5/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 612
  Batch size = 8


# Save Predicitons

In [10]:
# Save predicted politeness labels
with open(POL_LABEL_FILE, 'w') as f:
    for i in sk_predictions:
        f.write(str(i)+'\n')

# Save predicted politeness scores
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

pol_score_lst = []
for x in sk_logits[0]:
    pol_score_lst.append(softmax(x)[1])

with open(POL_SCORE_FILE, 'w') as f:
    for i in pol_score_lst:
        f.write(str(i)+'\n')