# Approach

Will work to evaluate the model. Last competition's public NB that uses combined submission also has a binary output, and a score of 0.97. We'll use those probability scores with weights for each comment type to generate a final regressor score. Bertweet model will get preprocessed inputs (except url and twitter handle preprocessing) and a regressor head on top of its model is finetuned.

This is Roberta. So we can easily squeeze in 5 checkpoints in 20GB. 

In [None]:
N_EPOCHS = 1

mname = "vinai/bertweet-large"
CFG = {"lr": 1e-5, "batch_size": 4, \
       "grad_acc_steps": 8, "fp16": False, \
       "eval_steps": 250, "wandb_logging_steps": 30}

run_id = "bertweet_lg_better_baseline"
kaggle_source = "jdoesv/baseline-model-bertweet-regression, V2"

## Data prep
- Training data: Weighted scores from that prediction notebook
- Validation data: Jigsaw validation data min-max scaled.
- Both have sentence, label as inputs

In [None]:
import pandas as pd

#Sampled in source NB. TODO: Remove that and use weighted DL here. 
traindf = pd.read_csv("../input/baseline-data-prep/cleaned_train_actual.csv")
traindf["normed_text"].fillna("", inplace=True)
traindf = traindf.sample(frac=1).reset_index(drop=True)
traindf = traindf[["normed_text", "new_wscore"]] #Now we need to sample
traindf.rename(columns={"normed_text": "sentence", "new_wscore": "label"}, inplace=True)
display(traindf["label"].describe())

validdf = pd.read_csv("../input/wranged-validation-data/eval_regression.csv")
validdf = validdf[["tclean", "reg_rank"]]
validdf["reg_rank"] = (validdf["reg_rank"]-validdf["reg_rank"].min())/(validdf["reg_rank"].max()-validdf["reg_rank"].min())
validdf.rename(columns={"tclean": "sentence", "reg_rank": "label"}, inplace=True)
display(validdf["label"].describe())

## Training prep

Pitfalls
1. Token length > 512. For now, adopt truncation
2. Other language data. For now, ignore. 
3. Not a rankable. TODO: Custom loss function: Take 10 pivot elements (1 per decimal range). Contrastive loss against each of these pivot and aggregate to generate loss fn. Should teach the model a measure of relative-toxicity. With 50 steps itself, model is at a 0.03 MSE loss (=== 0.17 RMSE) already. For now ignore.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSequenceClassification.from_pretrained(mname, num_labels=1)

In [None]:
from datasets import Dataset

train_ds = Dataset.from_pandas(traindf)
valid_ds = Dataset.from_pandas(validdf)

key1, key2 = "sentence", None
def preprocess_function(examples):
    if key2 is None:
        return tokenizer(examples[key1], padding=True, truncation=True, max_length=512) 
    return tokenizer(examples[key1], examples[key2], padding=True, truncation=True, max_length=512) #We allow truncation. So encoded ds and actual ds should share the same size. 

train_ds = train_ds.map(preprocess_function, batched=True)
valid_ds = valid_ds.map(preprocess_function, batched=True)

In [None]:
from kaggle_secrets import UserSecretsClient
import wandb
user_secrets = UserSecretsClient()
wandbkey = user_secrets.get_secret("wandbkey")
wandb.login(key=wandbkey)

In [None]:
wandb_kwargs = {"project":"kaggle_jigsaw", "tags": ["bertweet-lg", "baseline"], "name": run_id, "reinit": True, "notes": kaggle_source}

In [None]:
from transformers import TrainingArguments, Trainer, default_data_collator
data_collator = default_data_collator

run = wandb.init(**wandb_kwargs)

args = TrainingArguments(
    run_id,
    evaluation_strategy="steps",
    eval_steps=CFG["eval_steps"],
    save_strategy = "steps",
    save_steps=CFG["eval_steps"],
    learning_rate=CFG["lr"],
    per_device_train_batch_size=CFG["batch_size"],
    gradient_accumulation_steps=CFG["grad_acc_steps"],
    per_device_eval_batch_size=64,
    num_train_epochs=N_EPOCHS,
    weight_decay=0.01,
    warmup_ratio=0.1,
    report_to=["wandb"],
    run_name=run_id,
    logging_steps=CFG["wandb_logging_steps"],
    #seed=4142
)

trainer = Trainer(
    args=args,
    model=model,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()
run.finish() 