Hi, for fun and to remind myself of some of the **Hugging Face** functionalities, I created a notebook in which I fine-tune the distilBERT model using **data from the "Learning Agency Lab - Automated Essay Scoring 2.0"** competition. 

The "y" in this competition is a score between 1-6, but I am creating a model that, based on the full content of the essay, is to **predict whether a student writing a essay will receive a score greater than or equal to 4 (4, 5, or 6) or less than 4 (1, 2, and 3)**. 

The **model's prediction can be used as the explained feature** (the probability of label "1" which is a high score), but due to the "flattening" of the information about the target feature, it does not seem to have much predictive ability. 

Have fun!

In [None]:
DEBUG = False

In [None]:
! rm -rf /opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info
!pip install evaluate

import pandas as pd
import numpy as np
from datasets import load_metric, Dataset, DatasetDict
from transformers import  AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from sklearn.metrics import classification_report
import evaluate

In [None]:
!pip install huggingface_hub
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('')"

In [None]:
essays = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')
essays.head()

In [None]:
conditions = [
    (essays['score'] >= 4),
    (essays['score'] < 4),
]

choices = [1, 0]


essays['label'] = np.select(conditions, choices, default = np.nan)

In [None]:
essays = essays[essays['label'].notnull()]
essays = essays[['full_text', 'label']]
essays = essays.rename(columns={"full_text": "text"})
essays['label'] = essays['label'].astype('int')

In [None]:
essays

In [None]:
essays['label'].value_counts()

In [None]:
train = essays.head(8500)
valid = essays.tail(7807)
test = valid.tail(7400)

In [None]:
print("Train rows: ", train.shape[0])
print("Valid rows: ", valid.shape[0])
print("Test rows: ", test.shape[0])

In [None]:
if DEBUG:
    train = train.head(150)
    valid = valid.head(50)
    test = test.head(100)

In [None]:
train = Dataset.from_dict(train)
valid = Dataset.from_dict(valid)
test = Dataset.from_dict(test)
dataset = DatasetDict({"train": train, "valid": valid, "test": test})

In [None]:
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def tokenize(examples):
    outputs = tokenizer(examples['text'], truncation=True)
    return outputs

tokenized_dataset = dataset.map(tokenize, batched=True)

In [None]:
tokenized_dataset

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc") 
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments(num_train_epochs=1,
                                  output_dir="distilbert-food",
                                  push_to_hub=True,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  evaluation_strategy="epoch",
                                  weight_decay=5e-4)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
trainer = Trainer(model = model, tokenizer = tokenizer,
                  data_collator = data_collator,
                  args = training_args,
                  train_dataset = tokenized_dataset["train"],
                  eval_dataset = tokenized_dataset["valid"], 
                  compute_metrics = compute_metrics)

In [None]:
trainer.train()

In [None]:
# Make prediction on evaluation dataset
y_pred = trainer.predict(tokenized_dataset["test"]).predictions
y_pred = np.argmax(y_pred, axis=-1)

y_true = tokenized_dataset["test"]["label"]
y_true = np.array(y_true)

# Print the classification report
print(classification_report(y_true, y_pred, digits=4))

In [None]:
model_id = "Michau96" + "/" + model_name + "_essay_scoring_kaggle" 

In [None]:
model.push_to_hub(model_id) 
trainer.push_to_hub(model_id)

In [None]:
trainer.save_model('distilbert_essay_scoring_kaggle')

**Thanks for reading my notebook!**

**If you have any suggestions for improving the analysis or questions, let me know in the comment!**

**If you appreciate my work in this notebook, give upvote!**

**If you have a moment, I recommend looking at my other [projects](https://www.kaggle.com/michau96/code).**