<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/course_project_Reza2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari
- Date: June 2024
- Chosen Corpus: imdb
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification. A set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
- Paper(s) and other published materials related to the corpus: Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
- State-of-the-art performance (best published results) on this corpus: Achieving an f1-score of 77.66% and an accuracy of 77.81% to predict positive or negaitve sentiments according to the recent publicaiton in 2023 of "Opinion classification for IMDb review based using naive bayes method" (https://doi.org/10.1063/5.0171628).

---

## 1. Setup

In [17]:
# Your code to install and import libraries etc. here
# Install necessary libraries
!pip install --quiet datasets transformers[torch] optuna plotly

# Import libraries
from pprint import pprint
import datasets
import torch

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [13]:
# Your code to download the corpus here
dataset1 = datasets.load_dataset("imdb")
dataset1

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [14]:
dataset=dataset1.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
del dataset["unsupervised"] #Delete the unlabeled part of the dataset, we don't need it for anything

In [15]:
pprint(dataset['train'][0]['text'])
print(dataset['train'][0]['label'])

('This programme bugs me! There is no humour to it and is far too serious to '
 'be called "fun"! It\'s just far too educational for my liking! The '
 'characters are very stereotyped and unappealing. The plots are redundant and '
 "the morals are just repeated over and over again. Where's the fun in it? "
 'Also I feel this has been on the BBC for far too long and is broadcast way '
 'too much. Does it really need to have a slot on T.V every 2 or 3 months when '
 "a brand new show runs out of episodes? I think it's time that the BBC "
 'starting bringing back some of their older shows like: Inspector Gadget, '
 'Bananaman, The Smurfs, Snorks, Moomins, the Raccoons and Count Duckula other '
 'than continually giving contracts to these newer shows! I thought the BBC '
 "where bring back Danger Mouse, so what's going on with that?! 3/10")
0


### 2.2. Preprocessing

In [16]:
#Tokenize and map vocabulary
import sklearn.feature_extraction

# max_features means the size of the vocabulary
# which means max_features most-common words
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in dataset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary

In [18]:
#Building the feature vectors
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["text"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dataset["train"][0])

In [19]:
print(vectorized)

{'input_ids': array([   10,   605,   774,   887,  1115,  1495,  1700,  1702,  1756,
        2317,  2405,  2406,  2423,  2514,  2674,  3062,  3975,  4136,
        4552,  5394,  5827,  6181,  6315,  6663,  6765,  7127,  7388,
        7439,  7695,  7777,  8289,  8322,  8826,  9085,  9348,  9602,
        9630,  9890, 10475, 10483, 10628, 11176, 11650, 11675, 11752,
       11778, 11857, 11993, 12060, 12065, 12134, 12363, 12411, 12437,
       12504, 12564, 12577, 12627, 13362, 13827, 14349, 14465, 14682,
       15273, 15791, 16033, 16049, 16357, 16478, 16542, 16924, 16993,
       17885, 17893, 17897, 17907, 17929, 17938, 17954, 17968, 17983,
       18074, 18115, 18166, 18602, 19112, 19440, 19540, 19549, 19551,
       19712], dtype=int32)}


In [20]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forgot we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('10, again, also, and, are, back, bbc, be, been, brand, bring, bringing, '
 'broadcast, bugs, called, characters, continually, count, danger, does, '
 'educational, episodes, every, far, feel, for, fun, gadget, giving, going, '
 'has, have, humour, in, inspector, is, it, just, like, liking, long, me, '
 'months, morals, mouse, much, my, need, new, newer, no, of, older, on, or, '
 'other, out, over, plots, programme, really, redundant, repeated, runs, '
 'serious, show, shows, slot, so, some, starting, stereotyped, than, that, '
 'the, their, there, these, think, this, thought, time, to, too, unappealing, '
 'very, way, what, when, where, with')


In [25]:
# Apply the tokenizer to the whole dataset using .map()
dataset_tokenized = dataset.map(vectorize_example,num_proc=4)
pprint(dataset_tokenized["train"][0])

{'input_ids': [10,
               605,
               774,
               887,
               1115,
               1495,
               1700,
               1702,
               1756,
               2317,
               2405,
               2406,
               2423,
               2514,
               2674,
               3062,
               3975,
               4136,
               4552,
               5394,
               5827,
               6181,
               6315,
               6663,
               6765,
               7127,
               7388,
               7439,
               7695,
               7777,
               8289,
               8322,
               8826,
               9085,
               9348,
               9602,
               9630,
               9890,
               10475,
               10483,
               10628,
               11176,
               11650,
               11675,
               11752,
               11778,
               11857,
         

In [29]:
#Padding and Collation (forming a batch)
def collator(list_of_examples):
    #this is easy, labels are made into a single tensor
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))}
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it

In [28]:
#Build a batch from 2 examples, with padding
batch=collator([dataset_tokenized["train"][2],dataset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
pprint(batch["labels"])
pprint(batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 106])
tensor([1, 1])
tensor([[  436,   684,   727,   793,   825,   887,  1300,  1371,  1702,  1987,
          2604,  2792,  2997,  3599,  4042,  4254,  4328,  4424,  4521,  4629,
          5068,  5104,  5405,  5442,  6100,  6315,  6851,  6939,  7127,  7133,
          7346,  7404,  7694,  8476,  8553,  8929,  9085,  9485,  9563,  9602,
          9630, 10037, 10241, 10576, 11233, 11376, 11765, 12055, 12060, 12202,
         12212, 12363, 12364, 12439, 13051, 13100, 13160, 13211, 13474, 13713,
         14131, 14277, 14310, 14349, 14631, 15194, 15390, 15502, 15755, 15847,
         15919, 16493, 17811, 17897, 17968, 18000, 18074, 18115, 18167, 18225,
         19414, 19509, 19639, 19716, 19923, 19927,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  887,   909,   959,  1157,  1300,  1702,  1732,  2136,  260

---

## 3. Machine learning model

### 3.1. Model training

In [30]:
#Build the MLP model
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This function is relatively clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None):
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)
        projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)
        logits=self.output(projected)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # If we have labels, we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)


In [32]:
# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=2)

# And now we can instantiate it
mlp=MLP(mlp_config)

#we can make a little test with a fake batch formed by the two first example
fake_batch=collator([dataset_tokenized["train"][0],dataset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.7343, grad_fn=<NllLossBackward0>),
 tensor([[-0.1822, -0.1145],
         [-0.2030, -0.1093]], grad_fn=<AddmmBackward0>))

In [33]:
#Train the model
# Set training arguments
# their names are mostly self-explanatory

#add two loops to go over various learning rates and batch sizes
####################################################
def training_arguments(learning_rate=1e-5, batch_size=128):
  #######################################################################################
  trainer_args = transformers.TrainingArguments(
      "mlp_checkpoints", #save checkpoints here
      evaluation_strategy="steps",
      logging_strategy="steps",
      eval_steps=500,
      logging_steps=500,

      ###########################################################################################
      #Explore various learning rates
      learning_rate=learning_rate, #1e-5, #learning rate of the gradient descent
      ##########################################################################################

      max_steps=20000,
      load_best_model_at_end=True,
      ####################################################################################
      #Explore various batch sizes
      per_device_train_batch_size=batch_size#128
      #################################################################################
  )

  pprint(trainer_args)
  return trainer_args

In [34]:
# A metric for evaluating performance
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
import transformers
from pprint import pprint

# Define training arguments
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints",            # Directory to save checkpoints/models
    evaluation_strategy="steps",  # Evaluate every `eval_steps` steps
    logging_strategy="steps",     # Log every `logging_steps` steps
    eval_steps=500,               # Evaluate every 500 steps
    logging_steps=500,            # Log every 500 steps
    learning_rate=1e-4,           # Learning rate for gradient descent
    max_steps=20000,              # Maximum number of training steps
    load_best_model_at_end=True,  # Load the best model at the end of training
    per_device_train_batch_size=128  # Batch size per GPU/device
)

pprint(vars(trainer_args))  # Print all arguments as a dictionary


### 3.2 Hyperparameter optimization

In [13]:
# Your code for hyperparameter optimization here


### 3.3. Evaluation on test set

In [14]:
# Your code to evaluate the final model on the test set here
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [15]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6903,0.666271,0.609
1000,0.6235,0.595945,0.768
1500,0.5399,0.524087,0.821
2000,0.4651,0.46852,0.834
2500,0.4087,0.428132,0.848
3000,0.3651,0.397365,0.852
3500,0.3319,0.373937,0.858
4000,0.3064,0.355893,0.859
4500,0.2839,0.34126,0.862
5000,0.2663,0.329236,0.866


TrainOutput(global_step=15500, training_loss=0.25699089050292967, metrics={'train_runtime': 454.2342, 'train_samples_per_second': 5635.86, 'train_steps_per_second': 44.03, 'total_flos': 20612312640.0, 'train_loss': 0.25699089050292967, 'epoch': 79.08163265306122})

In [16]:
eval_results = trainer.evaluate(dataset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.27946680784225464, 'eval_accuracy': 0.88808, 'eval_runtime': 8.2132, 'eval_samples_per_second': 3043.871, 'eval_steps_per_second': 380.484, 'epoch': 79.08163265306122}


In [17]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 64, 128, 256])


    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints", #save checkpoints here
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, #learning rate of the gradient descent
        max_steps=10000, #I will keep this small, no time to wait too long :)
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    mlp = MLP(mlp_config)

    trainer = transformers.Trainer(
        model=mlp,
        args=trainer_args,
        train_dataset=dataset_tokenized["train"],
        eval_dataset=dataset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
        compute_metrics=compute_accuracy,
        data_collator=collator,
        callbacks=[early_stopping]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=7) #I will keep this small, no time to wait :)

[I 2024-06-16 05:51:20,446] A new study created in memory with name: no-name-b600b836-a420-4e06-b7ad-b6dba517824d
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5306,0.43422,0.842
1000,0.3274,0.343792,0.868
1500,0.2452,0.308511,0.884
2000,0.2009,0.292937,0.882
2500,0.1699,0.287313,0.881
3000,0.1468,0.285695,0.877
3500,0.1294,0.285942,0.879
4000,0.1147,0.292289,0.876
4500,0.1036,0.298928,0.877
5000,0.0933,0.302331,0.876


[I 2024-06-16 05:53:46,370] Trial 0 finished with value: 0.877 and parameters: {'learning_rate': 0.0003236332328108065, 'batch_size': 128}. Best is trial 0 with value: 0.877.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6865,0.680071,0.676
1000,0.6717,0.668126,0.766
1500,0.658,0.656776,0.789
2000,0.6451,0.64629,0.796
2500,0.6333,0.636532,0.802
3000,0.6223,0.627584,0.807
3500,0.6123,0.619411,0.807
4000,0.6032,0.611966,0.811
4500,0.595,0.60524,0.812
5000,0.5875,0.599262,0.812


[I 2024-06-16 06:01:39,136] Trial 1 finished with value: 0.816 and parameters: {'learning_rate': 1.0531766239541161e-05, 'batch_size': 256}. Best is trial 0 with value: 0.877.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5691,0.434823,0.841
1000,0.3139,0.327423,0.869
1500,0.2233,0.295354,0.881
2000,0.1783,0.284937,0.885
2500,0.1469,0.284614,0.88
3000,0.1237,0.289808,0.88
3500,0.1063,0.295557,0.879
4000,0.0914,0.307313,0.882
4500,0.0806,0.322199,0.88
5000,0.0702,0.330465,0.879


[I 2024-06-16 06:03:52,932] Trial 2 finished with value: 0.88 and parameters: {'learning_rate': 0.00047558359929035857, 'batch_size': 128}. Best is trial 2 with value: 0.88.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.3928,0.316103,0.873
1000,0.2133,0.28403,0.88
1500,0.1521,0.287761,0.887
2000,0.1141,0.308302,0.879
2500,0.0856,0.325856,0.881
3000,0.0668,0.349655,0.872
3500,0.0533,0.375826,0.872


[I 2024-06-16 06:04:49,633] Trial 3 finished with value: 0.88 and parameters: {'learning_rate': 0.0008237555164262499, 'batch_size': 64}. Best is trial 2 with value: 0.88.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.7777,0.700696,0.535
1000,0.681,0.658331,0.59
1500,0.6363,0.625271,0.645
2000,0.601,0.59752,0.685
2500,0.5714,0.574645,0.724
3000,0.546,0.554697,0.746
3500,0.5239,0.538459,0.775
4000,0.5071,0.524693,0.791
4500,0.4908,0.512659,0.807
5000,0.4788,0.502397,0.809


[I 2024-06-16 06:09:20,959] Trial 4 finished with value: 0.825 and parameters: {'learning_rate': 2.2680194815903982e-05, 'batch_size': 128}. Best is trial 2 with value: 0.88.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6682,0.647515,0.818
1000,0.6201,0.609406,0.814
1500,0.5796,0.577205,0.828
2000,0.547,0.552099,0.831
2500,0.5191,0.530915,0.829
3000,0.4973,0.512558,0.837
3500,0.4773,0.497029,0.838
4000,0.4608,0.484455,0.839
4500,0.4476,0.473588,0.84
5000,0.4353,0.464669,0.841


[I 2024-06-16 06:12:02,600] Trial 5 finished with value: 0.853 and parameters: {'learning_rate': 2.6004179785978694e-05, 'batch_size': 64}. Best is trial 2 with value: 0.88.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.7172,0.722806,0.48
1000,0.7118,0.717479,0.48
1500,0.7064,0.713326,0.48
2000,0.7067,0.709311,0.48
2500,0.6962,0.706442,0.48
3000,0.6993,0.703626,0.48
3500,0.6935,0.701344,0.481
4000,0.6931,0.699202,0.484
4500,0.6919,0.697173,0.49
5000,0.6914,0.695515,0.494


[I 2024-06-16 06:13:22,768] Trial 6 finished with value: 0.516 and parameters: {'learning_rate': 3.6001326855292456e-06, 'batch_size': 16}. Best is trial 2 with value: 0.88.


In [18]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

Best trial (number 2):
  Value: 0.88
  Params: {'learning_rate': 0.00047558359929035857, 'batch_size': 128}

All trials:
  Trial 0:
    Value: 0.877
    Params: {'learning_rate': 0.0003236332328108065, 'batch_size': 128}
  Trial 1:
    Value: 0.816
    Params: {'learning_rate': 1.0531766239541161e-05, 'batch_size': 256}
  Trial 2:
    Value: 0.88
    Params: {'learning_rate': 0.00047558359929035857, 'batch_size': 128}
  Trial 3:
    Value: 0.88
    Params: {'learning_rate': 0.0008237555164262499, 'batch_size': 64}
  Trial 4:
    Value: 0.825
    Params: {'learning_rate': 2.2680194815903982e-05, 'batch_size': 128}
  Trial 5:
    Value: 0.853
    Params: {'learning_rate': 2.6004179785978694e-05, 'batch_size': 64}
  Trial 6:
    Value: 0.516
    Params: {'learning_rate': 3.6001326855292456e-06, 'batch_size': 16}


In [19]:
import optuna.visualization as vis
fig = vis.plot_parallel_coordinate(study)
fig.show()

In [20]:
fig = vis.plot_slice(study)
fig.show()

In [21]:
fig = vis.plot_optimization_history(study)
fig.show()

In [22]:
import optuna
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    num_train_epochs = trial.suggest_int("num_train_epochs", 1, 5)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])

    # Load the dataset, tokenizer, and model
    dataset = ...
    tokenizer = AutoTokenizer.from_pretrained("your-pretrained-model")
    model = AutoModelForSequenceClassification.from_pretrained("your-pretrained-model")

    # Set up training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=num_train_epochs,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        eval_steps=100,
        disable_tqdm=True,
        report_to="none",
    )

    # Set up the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_loss"]

In [23]:
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=20)

[I 2024-06-16 06:13:24,035] A new study created in memory with name: no-name-bcebb09f-4d84-497c-8130-4641049d6e35
[W 2024-06-16 06:13:24,148] Trial 0 failed with parameters: {'learning_rate': 3.455030671026054e-05, 'num_train_epochs': 1, 'batch_size': 8} because of the following error: OSError("your-pretrained-model is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`").
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url

OSError: your-pretrained-model is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [None]:
pip install optuna
pip install plotly


In [None]:
import optuna.visualization as vis

fig = vis.plot_parallel_coordinate(study)
fig.show()

In [None]:

fig = vis.plot_slice(study)
fig.show()

In [None]:
fig = vis.plot_contour(study, params=['learning_rate', 'num_train_epochs', 'batch_size'])
fig.show()

In [None]:
fig = vis.plot_optimization_history(study)
fig.show()

In [None]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [None]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [None]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [None]:
# Include your annotated out-of-domain data here