<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/course_project_Reza2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari
- Date: April 2024
- Chosen Corpus: imdb
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification. A set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
- Paper(s) and other published materials related to the corpus: Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
- State-of-the-art performance (best published results) on this corpus:

---

## 1. Setup

In [39]:
# Your code to install and import libraries etc. here
!pip install --quiet datasets transformers[torch] evaluate optuna plotly
import datasets
from pprint import pprint #pprint => pretty-print

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m837.7 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.8/78.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [13]:
# Your code to download the corpus here¨
dataset = datasets.load_dataset("stanfordnlp/imdb")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [37]:
dataset['train'][0:2]

{'text': ["Before watching this film I had very low expectations and went to just see the cars. Eventually I even regretted going for that reason. Plot is almost non-existent. Character development is non-existent. So many clichés and so much jaw-dropping cheesiness existed in the movie that I could only stare and wonder how it was even released. If not for the exotics, I wouldn't have even rated this movie a 1. An attempt at a coherent story line is destroyed by the sheer absurdity of this elite racing cult and the laughable characters that make up its members. In fact, the movie's plot is so predictable and simple-minded that an average child could foretell the majority of the storyline. Bad acting, bad plot, bad jokes, bad movie.<br /><br />Don't see it. Play Gran Turismo HD instead and it'll satiate your thirst for fast sexy cars without leaving a bad aftertaste.",
  "I had never heard of this flick despite the connection to George Clooney (whose company produced and he appears in 

### 2.2. Preprocessing

In [15]:
# Your code for any necessary preprocessing here
#This is never a bad idea, datasets may have ordering to them, which is not what we want
dataset=dataset.shuffle()
#Delete the unlabeled part of the dataset, we don't need it for anything
del dataset["unsupervised"]

In [18]:
import sklearn.feature_extraction
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True, max_features=25000)

#get a list of all texts from the training data
texts=[ex["text"] for ex in dataset["train"]]

#"Trains" the vectorizer, i.e. builds its vocabulary
vectorizer.fit(texts)

In [21]:
def vectorize_example(ex):
    #because the vectorizer expects a list/iterable over inputs, not one input
    vectorized=vectorizer.transform([ex["text"]])
    #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features=vectorized.nonzero()[1]
    #feature index 0 will have a special meaning
    # so let us produce it by adding +1 to everything
    non_zero_features+=1
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dataset["train"][0])

In [22]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    ## It is easy to forgot we moved all by +1
    words.append(idx2word[idx-1])
#This is now the bag of words representation of the document
pprint(", ".join(words))

('absurdity, acting, aftertaste, almost, an, and, at, attempt, average, bad, '
 'before, br, by, cars, character, characters, cheesiness, child, clichés, '
 'coherent, could, cult, destroyed, development, don, dropping, elite, even, '
 'eventually, existed, existent, expectations, fact, fast, film, for, going, '
 'had, have, hd, how, if, in, instead, is, it, its, jaw, jokes, just, '
 'laughable, leaving, line, ll, low, majority, make, many, members, minded, '
 'movie, much, non, not, of, only, play, plot, predictable, racing, rated, '
 'reason, regretted, released, see, sexy, sheer, simple, so, stare, story, '
 'storyline, that, the, thirst, this, to, up, very, was, watching, went, '
 'without, wonder, wouldn, your')


In [38]:
# Apply the tokenizer to the whole dataset using .map()
dataset_tokenized = dataset.map(vectorize_example,num_proc=4)
pprint(dataset_tokenized["train"][0])

{'input_ids': [374,
               484,
               698,
               905,
               1035,
               1066,
               1566,
               1607,
               1722,
               1830,
               2157,
               2840,
               3258,
               3529,
               3779,
               3789,
               3879,
               3934,
               4215,
               4358,
               5123,
               5455,
               6121,
               6177,
               6695,
               6913,
               7295,
               7771,
               7779,
               7918,
               7920,
               7941,
               8099,
               8227,
               8451,
               8767,
               9567,
               10018,
               10262,
               10294,
               10824,
               11010,
               11216,
               11565,
               11851,
               11881,
               11889,
       

In [40]:
import torch

def collator(list_of_examples):
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))} #this is easy, labels are made into a single tensor
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it


---

## 3. Machine learning model

### 3.1. Model training

In [41]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)

        #### MODIFIED HERE FOR EXERCISE 5 -> commented out
        ####projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)

        #### MODIFIED HERE FOR EXERCISE 5 -> base it off embedded_summed
        ##### OLD: logits=self.output(projected)
        logits=self.output(embedded_summed)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()


####### CHANGED HERE FOR EXERCISE 5, hidden_size from 20 to 1
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=1,nlabels=2)


In [43]:
# And we can make a model
mlp=MLP(mlp_config)
fake_batch=collator([dataset_tokenized["train"][0],dataset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.7372, grad_fn=<NllLossBackward0>),
 tensor([[0.7065, 0.1133],
         [0.7142, 0.1188]], grad_fn=<AddmmBackward0>))

In [44]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

### 3.2 Hyperparameter optimization

In [6]:
# Your code for hyperparameter optimization here


### 3.3. Evaluation on test set

In [45]:
# Your code to evaluate the final model on the test set here
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [47]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6748,0.650864,0.763
1000,0.6036,0.580271,0.828
1500,0.5217,0.514105,0.836
2000,0.4529,0.46203,0.848
2500,0.3999,0.423566,0.859
3000,0.358,0.393951,0.861
3500,0.3263,0.371326,0.867
4000,0.3003,0.354355,0.872
4500,0.2784,0.340955,0.874
5000,0.2609,0.330168,0.874


TrainOutput(global_step=13000, training_loss=0.2733456326998197, metrics={'train_runtime': 435.0373, 'train_samples_per_second': 5884.553, 'train_steps_per_second': 45.973, 'total_flos': 17285442432.0, 'train_loss': 0.2733456326998197, 'epoch': 66.3265306122449})

In [49]:
eval_results = trainer.evaluate(dataset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.27793562412261963, 'eval_accuracy': 0.88948, 'eval_runtime': 21.9119, 'eval_samples_per_second': 1140.935, 'eval_steps_per_second': 142.617, 'epoch': 66.3265306122449}


In [50]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 64, 128, 256])


    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints", #save checkpoints here
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, #learning rate of the gradient descent
        max_steps=10000, #I will keep this small, no time to wait too long :)
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    mlp = MLP(mlp_config)

    trainer = transformers.Trainer(
        model=mlp,
        args=trainer_args,
        train_dataset=dataset_tokenized["train"],
        eval_dataset=dataset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
        compute_metrics=compute_accuracy,
        data_collator=collator,
        callbacks=[early_stopping]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=7) #I will keep this small, no time to wait :)

[I 2024-05-06 11:56:49,541] A new study created in memory with name: no-name-904d54ca-a712-4c53-9966-aa122dffb979
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5119,0.425763,0.855
1000,0.3228,0.347142,0.873
1500,0.2477,0.317156,0.879
2000,0.205,0.305269,0.88
2500,0.1756,0.301831,0.877
3000,0.1531,0.302249,0.882
3500,0.1359,0.305305,0.883
4000,0.122,0.311477,0.875
4500,0.1093,0.316243,0.877
5000,0.1,0.322844,0.877


[I 2024-05-06 11:59:12,032] Trial 0 finished with value: 0.877 and parameters: {'learning_rate': 0.0002814569334793928, 'batch_size': 128}. Best is trial 0 with value: 0.877.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6912,0.683589,0.55
1000,0.6796,0.674352,0.627
1500,0.6695,0.666163,0.691
2000,0.6597,0.658769,0.731
2500,0.6527,0.651901,0.754
3000,0.6447,0.645556,0.765
3500,0.6374,0.63987,0.775
4000,0.6315,0.634539,0.786
4500,0.6234,0.629706,0.792
5000,0.6188,0.625376,0.801


[I 2024-05-06 12:00:23,390] Trial 1 finished with value: 0.812 and parameters: {'learning_rate': 1.0018017519383743e-05, 'batch_size': 16}. Best is trial 0 with value: 0.877.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5351,0.460294,0.851
1000,0.3538,0.370723,0.865
1500,0.271,0.332514,0.868
2000,0.2236,0.315012,0.877
2500,0.1905,0.306252,0.88
3000,0.1675,0.303395,0.879
3500,0.1485,0.304775,0.879
4000,0.1336,0.306913,0.878
4500,0.1211,0.310581,0.875
5000,0.1109,0.315207,0.873


[I 2024-05-06 12:04:29,339] Trial 2 finished with value: 0.879 and parameters: {'learning_rate': 0.0002203441254376891, 'batch_size': 256}. Best is trial 2 with value: 0.879.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6682,0.627122,0.779
1000,0.5622,0.514474,0.836
1500,0.4525,0.432669,0.847
2000,0.3727,0.386552,0.86
2500,0.3304,0.354608,0.87
3000,0.3031,0.336449,0.877
3500,0.2733,0.324028,0.873
4000,0.2552,0.312926,0.873
4500,0.245,0.307644,0.873
5000,0.2216,0.3034,0.878


[I 2024-05-06 12:05:48,705] Trial 3 finished with value: 0.881 and parameters: {'learning_rate': 0.0002711357249711533, 'batch_size': 16}. Best is trial 3 with value: 0.881.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6923,0.690756,0.509
1000,0.6895,0.688572,0.53
1500,0.6869,0.686539,0.604
2000,0.6845,0.684644,0.665
2500,0.6822,0.682877,0.713
3000,0.6802,0.681234,0.751
3500,0.6783,0.679711,0.768
4000,0.6765,0.678305,0.778
4500,0.6748,0.677015,0.782
5000,0.6734,0.675841,0.787


[I 2024-05-06 12:13:15,667] Trial 4 finished with value: 0.804 and parameters: {'learning_rate': 1.4507454347811181e-06, 'batch_size': 256}. Best is trial 3 with value: 0.881.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.4712,0.350747,0.86
1000,0.2146,0.303301,0.874
1500,0.1426,0.310203,0.876
2000,0.1032,0.333006,0.869
2500,0.0766,0.359431,0.87
3000,0.0582,0.389073,0.864
3500,0.0446,0.423939,0.859


[I 2024-05-06 12:15:53,470] Trial 5 finished with value: 0.874 and parameters: {'learning_rate': 0.000693695683466865, 'batch_size': 256}. Best is trial 3 with value: 0.881.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.688,0.683859,0.778
1000,0.6776,0.675433,0.772
1500,0.6677,0.667361,0.777
2000,0.6585,0.659689,0.784
2500,0.6498,0.652523,0.788
3000,0.6415,0.645838,0.785
3500,0.634,0.639605,0.795
4000,0.6269,0.633885,0.796
4500,0.6204,0.628696,0.797
5000,0.6148,0.623948,0.801


[I 2024-05-06 12:20:03,306] Trial 6 finished with value: 0.809 and parameters: {'learning_rate': 1.0068025147504158e-05, 'batch_size': 128}. Best is trial 3 with value: 0.881.
[I 2024-05-06 12:20:03,321] A new study created in memory with name: no-name-dafc593b-c24c-40f5-b523-7cb8890bd8a6
max_steps is given, it will override any value given in num_train_epochs


In [None]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

In [None]:
import optuna.visualization as vis
fig = vis.plot_parallel_coordinate(study)
fig.show()

In [None]:
fig = vis.plot_slice(study)
fig.show()

In [None]:
fig = vis.plot_optimization_history(study)
fig.show()

In [None]:
import optuna
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    num_train_epochs = trial.suggest_int("num_train_epochs", 1, 5)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])

    # Load the dataset, tokenizer, and model
    dataset = ...
    tokenizer = AutoTokenizer.from_pretrained("your-pretrained-model")
    model = AutoModelForSequenceClassification.from_pretrained("your-pretrained-model")

    # Set up training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=num_train_epochs,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        eval_steps=100,
        disable_tqdm=True,
        report_to="none",
    )

    # Set up the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_loss"]

In [None]:
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=20)

In [None]:
pip install optuna
pip install plotly


In [None]:
import optuna.visualization as vis

fig = vis.plot_parallel_coordinate(study)
fig.show()

In [None]:

fig = vis.plot_slice(study)
fig.show()

In [None]:
fig = vis.plot_contour(study, params=['learning_rate', 'num_train_epochs', 'batch_size'])
fig.show()

In [None]:
fig = vis.plot_optimization_history(study)
fig.show()

In [None]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [8]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [9]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [10]:
# Include your annotated out-of-domain data here