## Minimalistic example to solve CommonLit with HuggingFace transformers and datasets

I'd like to write the simplest possible notebook to train and infer a pretrained transformer model on the CommonLit data. To do this, I use HuggingFace transformers along with their trainer, and HuggingFace datasets to preprocess the data. I created an offline package for HF datasets so that you can use it during inference mode. 

### Please upvote if you find this helpful :) 

In [None]:
!pip uninstall fsspec -qq -y
!pip install --no-index --find-links ../input/hf-datasets/wheels datasets -qq

In [None]:
import pandas as pd
from datasets import Dataset
from sklearn.metrics import mean_squared_error
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
# disable W&B logging as we don't have access to the internet
%env WANDB_DISABLED=True

## Config

In [None]:
model_checkpoint = '../input/distillbert-huggingface-model'
batch_size = 16
max_length = 256

## Loading and preprocessing training data with HF datasets

In [None]:
df = pd.read_csv('../input/step-1-create-folds/train_folds.csv') # https://www.kaggle.com/abhishek/step-1-create-folds
df = df.rename(columns={'target':'label'}) # HF expects this column name to pick up the target column in trainer

train_dataset = Dataset.from_pandas(df[df.kfold != 0].reset_index(drop=True))
valid_dataset = Dataset.from_pandas(df[df.kfold == 0].reset_index(drop=True))

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

def tokenize(batch): return tokenizer(batch['excerpt'], padding='max_length', truncation=True, max_length=max_length)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
valid_dataset = valid_dataset.map(tokenize, batched=True, batch_size=len(valid_dataset))

## Model and Training with HF transformers

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=1) # note this is actually a regression model

def compute_metrics(pred):
    return {
        'rmse': mean_squared_error(pred.label_ids, pred.predictions, squared=False),
    }

# hyperparameter tuning in this notebook: https://www.kaggle.com/thedrcat/commonlit-hf-trainer-hyperparameter-tuning/
args = TrainingArguments(
    "./tmp",
    evaluation_strategy = "epoch",
    learning_rate=9.734456575183276e-05,
    fp16=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    seed=2,
    weight_decay=0.006786875788460002,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

## Test Inference

In [None]:
test_df = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
test_df = test_df.rename(columns={'target':'label'})

test_dataset = Dataset.from_pandas(test_df)
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

test_preds = trainer.predict(test_dataset)

## Submission

In [None]:
sub = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv')
sub.target = test_preds[0]
sub.to_csv('submission.csv', index=False)
sub.head()

### Please upvote if you find this helpful :) 