# Recognizing Helpful Answers

Part 2 will have you developing classifiers using the very powerful Hugging Face library. Modern NLP has been driven by advancements in Large Language Models (LLMs) which, like we discussed in Week 3, learn to predict the word sequences. Substantial amounts of research have shown that once the models learn to “recognize language,” the parameters in these models (the weights in the neural network) can quickly be adapted to accomplish many NLP tasks.
For this assignment we will use a much smaller but nearly-as-performant [version of BERT](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) to train our models.

## Load Data

In [7]:
#!pip install -r requirements.txt

In [12]:
from datasets import load_dataset, load_metric, Dataset
from transformers import BertTokenizer, BertForSequenceClassification, EarlyStoppingCallback
from transformers import Trainer, TrainingArguments, EvalPrediction
import pandas as pd
import torch
import wandb
import os
from pathlib import Path
torch.cuda.empty_cache()

In [2]:
train_data = pd.read_csv('data/si630w22-hw3-train.csv')
dev_data = pd.read_csv('data/si630w22-hw3-dev.csv')
test_data = pd.read_csv('data/si630w22-hw3-test.public.csv')
q_and_a_data = pd.read_csv('data/si630w22-hw3-data.csv')

## Preprocessing

In [3]:
#combine the the different ratings
train_df = train_data.groupby('id').mean()
train_df = train_df.reset_index()
dev_df = dev_data.groupby('id').mean()
dev_df = dev_df.reset_index()

In [4]:
train_data.head() # we need to concat the annotator and group ids as well.

Unnamed: 0,id,annotator_id,rating,group
0,t3_n27vu3,user_00,5.0,group_09
1,t3_n27vu3,user_01,5.0,group_09
2,t3_n27vu3,user_02,5.0,group_09
3,t3_n2az7m,user_00,5.0,group_09
4,t3_n2az7m,user_01,5.0,group_09


In [5]:
test_data.head()

Unnamed: 0,id,annotator_id,group
0,t3_n2ooiu,user_00,group_09
1,t3_n2ooiu,user_01,group_09
2,t3_n2ooiu,user_02,group_09
3,t3_n2to6d,user_00,group_09
4,t3_n2to6d,user_01,group_09


In [6]:
train_df.rename(columns={'id':'question_id'}, inplace = True)
dev_df.rename(columns={'id':'question_id'}, inplace = True)
test_data.rename(columns={'id':'question_id'}, inplace = True)
train_df.rename(columns={'rating':'labels'}, inplace = True)
dev_df.rename(columns={'rating':'labels'}, inplace = True)

# train_df['labels'] = train_df.labels.apply(lambda x: [x])
# dev_df['labels'] = dev_df.labels.apply(lambda x: [x])

q_and_a_data['text'] = q_and_a_data.question_text + '[SEP]' + q_and_a_data.reply_text

merged_train_data = pd.merge(train_df,q_and_a_data[['text','question_id']], on='question_id', how='left')
merged_train_data.dropna(subset=['labels'], inplace = True)
merged_train_data.drop(columns=['question_id'], inplace = True)
merged_train_data.to_csv('data/merged_train_data3.csv', index=False)

merged_dev_data = pd.merge(dev_df, q_and_a_data[['text','question_id']], on='question_id', how='left')
merged_dev_data.dropna(subset=['labels'], inplace = True)
merged_dev_data.drop(columns=['question_id'], inplace = True)
merged_dev_data.to_csv('data/merged_dev_data3.csv', index=False)

merged_test_data = pd.merge(q_and_a_data[['text','question_id']], test_data, on='question_id', how='inner')
merged_test_data.drop_duplicates(subset=['question_id'],inplace=True)
merged_test_data.drop(columns=['annotator_id','group'], inplace = True)
merged_test_data.to_csv('data/merged_test_data3.csv', index=False)
merged_test_data.head()

Unnamed: 0,text,question_id
0,Should I be insecure of my penis size?[SEP]Tru...,t3_n2ooiu
3,how do you guys lay out for monthly AND weekly...,t3_n2to6d
8,What business could you realistic start with $...,t3_n2xuk5
13,Has anyone ever actually been in a classic hor...,t3_n2xrc5
18,If you could go back in time and change one th...,t3_n2yhgh


## Model Development & Training

Develop your code using huggingface’s Trainer class to train a classifier or regressor to predict the helpfulness rating of an answer.

In [7]:
data_files = {"train": "merged_train_data3.csv",
              "dev": "merged_dev_data3.csv",}
test_dataset = load_dataset('csv', data_files={"test": "merged_test_data3.csv"} )
dataset = load_dataset("csv", data_files=data_files)
dataset

Using custom data configuration default-a9942d6c6b21bc45


Downloading and preparing dataset csv/default to /home/sryanlee/.cache/huggingface/datasets/csv/default-a9942d6c6b21bc45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/sryanlee/.cache/huggingface/datasets/csv/default-a9942d6c6b21bc45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-4a35b8aecea6cbee


Downloading and preparing dataset csv/default to /home/sryanlee/.cache/huggingface/datasets/csv/default-4a35b8aecea6cbee/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/sryanlee/.cache/huggingface/datasets/csv/default-4a35b8aecea6cbee/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 3779
    })
    dev: Dataset({
        features: ['labels', 'text'],
        num_rows: 811
    })
})

In [8]:
test_dataset

DatasetDict({
    test: Dataset({
        features: ['text', 'question_id'],
        num_rows: 810
    })
})

In [9]:
# model = BertForSequenceClassification.from_pretrained("microsoft/MiniLM-L12-H384-uncased",problem_type="multi_label_classification",num_labels=5)
model = BertForSequenceClassification.from_pretrained("microsoft/MiniLM-L12-H384-uncased", num_labels=1)
# regressor uses num_labels=1, default is binary classification num_labels=2
tokenizer = BertTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased", padding = True, truncation=True ,max_length =512)

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/127M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

In [10]:
tokenized_train_dataset = dataset['train'].map(lambda x: tokenizer(x['text'],padding = 'max_length', max_length =512, truncation=True))
tokenized_dev_dataset = dataset['dev'].map(lambda x: tokenizer(x['text'],padding = 'max_length', max_length =512, truncation=True))
tokenized_test_dataset = test_dataset['test'].map(lambda x: tokenizer(x['text'],padding = 'max_length', max_length =512, truncation=True))

  0%|          | 0/3779 [00:00<?, ?ex/s]

  0%|          | 0/811 [00:00<?, ?ex/s]

  0%|          | 0/810 [00:00<?, ?ex/s]

In [11]:
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
tokenized_dev_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask','labels'])
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])

In [12]:
tokenized_train_dataset

Dataset({
    features: ['labels', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3779
})

In [13]:
from sklearn.metrics import mean_squared_error
import numpy as np
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions
    preds = preds.reshape(len(preds),)
    mse = mean_squared_error(labels, preds)
    return {
        'mse': mse
    }

In [14]:
training_args = TrainingArguments(
    output_dir = 'BERTSeq',
    num_train_epochs = 3,
    evaluation_strategy = 'steps',
    eval_steps = 500,
    learning_rate=1e-4,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    seed =0,
    load_best_model_at_end = True
#     do_train = True,
#     do_eval = True,
#     logging_strategy = 'epoch',
#     metric_for_best_model = 'eval_loss',
#     warmup_steps = 250,
#     weight_decay = 0.01,
)

In [15]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_dev_dataset,
    compute_metrics=compute_metrics,
#     compute_mse(tokenized_dataset['dev']['labels'],tokenized_dataset['dev']['labels']),
#     data_collator=data_collator,
#     tokenizer=tokenizer
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

In [18]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3779
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1419


Step,Training Loss,Validation Loss,Mse
500,0.4979,0.511253,0.511253
1000,0.4956,0.511253,0.511253


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 811
  Batch size = 8
Saving model checkpoint to BERTSeq/checkpoint-500
Configuration saved in BERTSeq/checkpoint-500/config.json
Model weights saved in BERTSeq/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 811
  Batch size = 8
Saving model checkpoint to BERTSeq/checkpoint-1000
Configuration saved in BERTSeq/checkpoint-1000/config.json
Model weights saved in BERTSeq/checkpoint-1000/pytorch_model.b

TrainOutput(global_step=1419, training_loss=0.49698239434681113, metrics={'train_runtime': 217.9194, 'train_samples_per_second': 52.024, 'train_steps_per_second': 6.512, 'total_flos': 746785732783104.0, 'train_loss': 0.49698239434681113, 'epoch': 3.0})

In [18]:
trainer.train()

Loading model from BERTSeq/checkpoint-1000).
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3779
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1419
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 2
  Continuing training from global step 1000
  Will skip the first 2 epochs then the first 54 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/54 [00:00<?, ?it/s]

Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33ms-ryanlee[0m (use `wandb login --relogin` to force relogin)


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from BERTSeq\checkpoint-500 (score: 0.5560224056243896).


TrainOutput(global_step=1419, training_loss=0.15732230898189747, metrics={'train_runtime': 6608.0898, 'train_samples_per_second': 1.716, 'train_steps_per_second': 0.215, 'total_flos': 746785732783104.0, 'train_loss': 0.15732230898189747, 'epoch': 3.0})

In [19]:
trainer.save_model('simple_best_model')

Saving model checkpoint to simple_best_model
Configuration saved in simple_best_model/config.json
Model weights saved in simple_best_model/pytorch_model.bin


In [20]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 811
  Batch size = 8


{'eval_loss': 0.5112533569335938,
 'eval_mse': 0.5112533569335938,
 'eval_runtime': 4.2314,
 'eval_samples_per_second': 191.664,
 'eval_steps_per_second': 24.106,
 'epoch': 3.0}

## Predictions

In [21]:
outputs = trainer.predict(tokenized_test_dataset)
y_pred = outputs.predictions

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_id, text. If question_id, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 810
  Batch size = 8


In [22]:
merged_test_data['predicted'] = y_pred
merged_test_data.drop(columns=['text'], inplace=True)
merged_test_data.rename(columns={'question_id':'id'}, inplace=True)
merged_test_data.to_csv('data/submission.csv', index=False)