<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/course_project_Reza2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project

- Student(s) Name(s): Mohammadreza Akhtari
- Date: June 2024
- Chosen Corpus: imdb
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification. A set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
- Paper(s) and other published materials related to the corpus: Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
- State-of-the-art performance (best published results) on this corpus: Opinion Mining on Movie Reviews Based on Deep Learning Models https://doi.org/10.32604/jai.2023.045617

---

## 1. Setup

In [1]:
# Your code to install and import libraries etc. here
!pip3 install -q transformers[torch] datasets evaluate optuna plotly

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [2]:
# Your code to download the corpus here
from pprint import pprint #pprint
import datasets
dset1=datasets.load_dataset("imdb")
pprint(dset1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [3]:
dset2=dset1.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
del dset2["unsupervised"] #Delete the unlabeled part of the dataset, we don't need it for anything

In [4]:
pprint(dset2['train'][0]['text'])
print(dset2['train'][0]['label'])

('After reading the original play I thought it would have been much more '
 'difficult to adapt to screen than it turned out to be. Donal McCann puts in '
 'a once-off great performance as Public Gar, the repressed antagonist who is '
 'manifested openly on screen by his extroverted (but unseen to others) '
 'alterego- Private Gar. Eamonn Kelly also plays an excellent "screwballs" '
 'whose inability to communicate his feelings is matched only by Gar.<br /><br '
 '/>Definitely worth renting out if you can find it. (Probably unavailable '
 'outside Ireland & UK)')
1


### 2.2. Preprocessing

In [5]:
dset=dset2
'''
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'https?://\S+', '<URL>', text)

    # Remove HTML tags (if any)
    text = re.sub(r'<.*?>', '', text)

    # Remove non-alphanumeric characters except spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join tokens back into a single string
    text = ' '.join(tokens)

    return text

def clean_dataset(dataset):
    cleaned_dataset = []
    for example in dataset:
        cleaned_example = {}
        cleaned_example['text'] = preprocess_text(example['text'])
        cleaned_example['label'] = example['label']
        cleaned_dataset.append(cleaned_example)
    return cleaned_dataset

# Apply to the dataset
dset = clean_dataset(dset2['train'])
'''

"\nimport re\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import word_tokenize\n\nnltk.download('punkt')\nnltk.download('stopwords')\n\ndef preprocess_text(text):\n    # Convert to lowercase\n    text = text.lower()\n\n    # Remove URLs\n    text = re.sub(r'https?://\\S+', '<URL>', text)\n\n    # Remove HTML tags (if any)\n    text = re.sub(r'<.*?>', '', text)\n\n    # Remove non-alphanumeric characters except spaces\n    text = re.sub(r'[^a-zA-Z\\s]', '', text)\n\n    # Tokenize text\n    tokens = word_tokenize(text)\n\n    # Remove stopwords\n    stop_words = set(stopwords.words('english'))\n    tokens = [word for word in tokens if word not in stop_words]\n\n    # Join tokens back into a single string\n    text = ' '.join(tokens)\n\n    return text\n\ndef clean_dataset(dataset):\n    cleaned_dataset = []\n    for example in dataset:\n        cleaned_example = {}\n        cleaned_example['text'] = preprocess_text(example['text'])\n        cleaned_example['label'

In [6]:
#Tokenize and map vocabulary
import sklearn.feature_extraction

vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in dset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary

In [7]:
#Building the feature vectors
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["text"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dset["train"][0])

In [8]:
print(vectorized)

{'input_ids': array([  448,   598,   775,   863,   969,  1209,  1703,  1757,  2300,
        2612,  2633,  2717,  3659,  4760,  5110,  6372,  6773,  6911,
        7910,  8311,  8582,  8921,  9084,  9085,  9568,  9595,  9621,
        9948, 10927, 11082, 11657, 11763, 12349, 12421, 12422, 12429,
       12449, 12528, 12555, 12567, 12605, 13048, 13315, 13325, 13761,
       13774, 13976, 14061, 14321, 14668, 14717, 15575, 17879, 17891,
       17976, 18112, 18520, 18579, 18605, 18877, 19593, 19610, 19799,
       19803, 19926], dtype=int32)}


In [9]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forgot we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('adapt, after, also, an, antagonist, as, be, been, br, but, by, can, '
 'communicate, definitely, difficult, excellent, feelings, find, great, have, '
 'his, if, in, inability, ireland, is, it, kelly, manifested, matched, more, '
 'much, off, on, once, only, openly, original, others, out, outside, '
 'performance, play, plays, private, probably, public, puts, reading, renting, '
 'repressed, screen, than, the, thought, to, turned, uk, unavailable, unseen, '
 'who, whose, worth, would, you')


In [10]:
# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dset.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

{'input_ids': [448,
               598,
               775,
               863,
               969,
               1209,
               1703,
               1757,
               2300,
               2612,
               2633,
               2717,
               3659,
               4760,
               5110,
               6372,
               6773,
               6911,
               7910,
               8311,
               8582,
               8921,
               9084,
               9085,
               9568,
               9595,
               9621,
               9948,
               10927,
               11082,
               11657,
               11763,
               12349,
               12421,
               12422,
               12429,
               12449,
               12528,
               12555,
               12567,
               12605,
               13048,
               13315,
               13325,
               13761,
               13774,
               13976,

In [11]:
#Padding and Collation (forming a batch)
import torch

def collator(list_of_examples):
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))} #this is easy, labels are made into a single tensor
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it

In [12]:
#Build a batch from 2 examples, with padding
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
pprint(batch["labels"])
pprint(batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 185])
tensor([0, 0])
tensor([[  731,   891,   940,   957,  1209,  1226,  1757,  1795,  1988,  2300,
          2649,  2886,  3046,  3067,  3076,  3593,  4138,  4280,  4381,  5409,
          5437,  5707,  6066,  6320,  6880,  6884,  6894,  6947,  7094,  7139,
          7491,  7531,  7684,  7783,  8032,  8213,  8294,  8311,  8493,  8557,
          9084,  9386,  9401,  9485,  9595,  9621,  9629,  9875,  9876,  9979,
          9999, 10219, 10224, 10368, 10453, 10554, 10610, 10756, 10817, 10958,
         11128, 11164, 11701, 11746, 11749, 11800, 11844, 11852, 12086, 12348,
         12421, 12423, 12555, 12856, 12905, 13209, 13668, 13777, 13787, 15312,
         15315, 15687, 15692, 16084, 16088, 16163, 16382, 16469, 16533, 17011,
         17050, 17214, 17474, 17483, 17852, 17879, 17887, 17889, 17891, 17904,
         17923, 17932, 17938, 17946, 17961, 18066, 18112, 18522, 18542, 18919,
         18960, 19049, 19155, 19388, 19396

---

## 3. Machine learning model

### 3.1. Model training

In [13]:
#Build the MLP model
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model

    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)
        logits=self.output(embedded_summed)

        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

In [14]:
# Configure the model:
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=1,nlabels=2)

In [15]:
#Make a model
mlp=MLP(mlp_config)
fake_batch=collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.2942, grad_fn=<NllLossBackward0>),
 tensor([[-0.4456,  0.6216],
         [-0.4556,  0.6226]], grad_fn=<AddmmBackward0>))

In [16]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
eval_strategy=steps,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp



In [17]:
# A metric for evaluating performance
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

In [18]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.8132,0.717333,0.526
1000,0.6688,0.641352,0.599
1500,0.5921,0.570332,0.725
2000,0.519,0.505845,0.795
2500,0.4569,0.455137,0.832
3000,0.4081,0.415796,0.851
3500,0.3682,0.384078,0.866
4000,0.3376,0.358987,0.878
4500,0.3113,0.338851,0.887
5000,0.2885,0.322087,0.889


TrainOutput(global_step=20000, training_loss=0.250186128616333, metrics={'train_runtime': 465.4908, 'train_samples_per_second': 5499.572, 'train_steps_per_second': 42.965, 'total_flos': 26150023872.0, 'train_loss': 0.250186128616333, 'epoch': 102.04081632653062})

In [19]:
#Evaluation
eval_results = trainer.evaluate(dset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.2793724834918976, 'eval_accuracy': 0.8888, 'eval_runtime': 6.7773, 'eval_samples_per_second': 3688.778, 'eval_steps_per_second': 461.097, 'epoch': 102.04081632653062}


### 3.2 Hyperparameter optimization

### 3.3. Evaluation on test set

In [20]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-1, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 64, 128, 256])

    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints", #save checkpoints here
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, #learning rate of the gradient descent
        max_steps=10000, #I will keep this small, no time to wait too long :)
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    mlp = MLP(mlp_config)

    trainer = transformers.Trainer(
        model=mlp,
        args=trainer_args,
        train_dataset=dset_tokenized["train"],
        eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
        compute_metrics=compute_accuracy,
        data_collator=collator,
        callbacks=[early_stopping]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=7) #I will keep this small, no time to wait :)

[I 2024-06-16 12:20:16,173] A new study created in memory with name: no-name-4145ee28-93f0-4de4-8ba8-7bb840e0384f
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.8063,0.798308,0.497
1000,0.7861,0.783824,0.497
1500,0.7776,0.771153,0.497
2000,0.7626,0.760345,0.497
2500,0.7534,0.751081,0.497
3000,0.7435,0.743127,0.498
3500,0.7387,0.736343,0.507
4000,0.7309,0.730608,0.506
4500,0.7273,0.725687,0.51
5000,0.7212,0.721477,0.514


[I 2024-06-16 12:22:45,682] Trial 0 finished with value: 0.516 and parameters: {'learning_rate': 1.3848899625564592e-06, 'batch_size': 64}. Best is trial 0 with value: 0.516.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.88,0.856887,0.503
1000,0.8397,0.824572,0.503
1500,0.8104,0.798325,0.503
2000,0.7836,0.777777,0.506
2500,0.7642,0.761898,0.504
3000,0.7492,0.749602,0.514
3500,0.7376,0.740105,0.521
4000,0.7285,0.732589,0.518
4500,0.7209,0.726526,0.519
5000,0.7145,0.721557,0.522


[I 2024-06-16 12:26:22,932] Trial 1 finished with value: 0.524 and parameters: {'learning_rate': 7.955752390193923e-06, 'batch_size': 128}. Best is trial 1 with value: 0.524.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.2059,0.410069,0.876
1000,0.0699,0.686511,0.853
1500,0.0362,1.019623,0.857
2000,0.0231,1.254269,0.856
2500,0.0152,1.448544,0.842
3000,0.0111,1.83465,0.839


[I 2024-06-16 12:27:20,661] Trial 2 finished with value: 0.876 and parameters: {'learning_rate': 0.03140094734688777, 'batch_size': 128}. Best is trial 2 with value: 0.876.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.1909,0.315814,0.886
1000,0.0267,0.479184,0.873
1500,0.0061,0.599777,0.871
2000,0.0025,0.683672,0.865
2500,0.0013,0.75385,0.863
3000,0.0008,0.808089,0.86


[I 2024-06-16 12:29:22,911] Trial 3 finished with value: 0.886 and parameters: {'learning_rate': 0.005331893132067162, 'batch_size': 256}. Best is trial 3 with value: 0.886.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.3603,0.308175,0.881
1000,0.2954,0.292662,0.876
1500,0.2987,0.271352,0.884
2000,0.1539,0.297651,0.876
2500,0.1564,0.30764,0.879
3000,0.1851,0.295927,0.876
3500,0.1079,0.411106,0.872
4000,0.0947,0.343268,0.873


[I 2024-06-16 12:29:46,092] Trial 4 finished with value: 0.884 and parameters: {'learning_rate': 0.0194354764365268, 'batch_size': 16}. Best is trial 3 with value: 0.886.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.4605,0.312727,0.895
1000,0.2233,0.252481,0.905
1500,0.1561,0.24367,0.899
2000,0.1176,0.250147,0.9
2500,0.0908,0.264752,0.902
3000,0.071,0.280496,0.896
3500,0.0566,0.300205,0.888
4000,0.0446,0.322884,0.885


[I 2024-06-16 12:31:05,594] Trial 5 finished with value: 0.899 and parameters: {'learning_rate': 0.000828301723840474, 'batch_size': 128}. Best is trial 5 with value: 0.899.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.1756,0.363405,0.884
1000,0.0173,0.563929,0.872
1500,0.0032,0.686361,0.874
2000,0.0014,0.765112,0.872
2500,0.0008,0.831607,0.871
3000,0.0005,0.887623,0.866


[I 2024-06-16 12:32:52,725] Trial 6 finished with value: 0.884 and parameters: {'learning_rate': 0.007734200830863314, 'batch_size': 256}. Best is trial 5 with value: 0.899.


In [21]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

Best trial (number 5):
  Value: 0.899
  Params: {'learning_rate': 0.000828301723840474, 'batch_size': 128}

All trials:
  Trial 0:
    Value: 0.516
    Params: {'learning_rate': 1.3848899625564592e-06, 'batch_size': 64}
  Trial 1:
    Value: 0.524
    Params: {'learning_rate': 7.955752390193923e-06, 'batch_size': 128}
  Trial 2:
    Value: 0.876
    Params: {'learning_rate': 0.03140094734688777, 'batch_size': 128}
  Trial 3:
    Value: 0.886
    Params: {'learning_rate': 0.005331893132067162, 'batch_size': 256}
  Trial 4:
    Value: 0.884
    Params: {'learning_rate': 0.0194354764365268, 'batch_size': 16}
  Trial 5:
    Value: 0.899
    Params: {'learning_rate': 0.000828301723840474, 'batch_size': 128}
  Trial 6:
    Value: 0.884
    Params: {'learning_rate': 0.007734200830863314, 'batch_size': 256}


In [22]:
import optuna.visualization as vis
fig = vis.plot_parallel_coordinate(study)
fig.show()


In [23]:
fig = vis.plot_slice(study)
fig.show()

In [24]:
fig = vis.plot_contour(study, params=['learning_rate', 'batch_size'])
fig.show()

In [25]:
fig = vis.plot_optimization_history(study)
fig.show()

---

## 4. Results and summary

### 4.1 Corpus insights

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The model predict whether a comment is positive or negative based on the training set, which is evaluated on the test set.
Analysing comments is very importnat to due the existense of abundance comments, which is not possibel to be handled by human. THe automatic evaluation of comments allows us to evaluate and rate each case based on the comments and opinions automatically in a real-time. As the language is processing over time, it is of great importance to develop the model and increase the parameters over time as the available comments, words, expressions, and advanced technology increases to achieve better results.

### 4.2 Results

My model inspired by the exercise 6 in the course to facilitate th programming and avoid mistakes. The model was able to achieve the accuracy of 89% ('learning_rate': 0.000828301723840474, 'batch_size': 128) while in https://doi.org/10.32604/jai.2023.045617, an impressive 93.74% accuracy on the IMDb Dataset has been avhievable recently using Extra-Long Neural Network (XLNet).
THe higher values achieve almost with lower learning rate but higher batch size huperparameters.

In another study in 2020, 91% was achievable in Sentiment analysis on IMDB using lexicon and neural networks.(https://doi.org/10.1007/s42452-019-1926-x). Almost the same value was achievable in the very new publication in 2024 of What's Next?: Exploring Machine Learning-Based Approaches to Content Suggestions Using IMDb Movie Reviews (https://doi.org10.1109/ICEIC61013.2024.10457130). HOwever, achieving an impressive 93.74% accuracy on the IMDb Dataset is seeable in Opinion Mining on Movie Reviews Based on Deep Learning Models (https://doi.org/10.32604/jai.2023.045617).

### 4.3 Relation to state of the art

As could be seen, my performance is close to the state of the art results. However, achieving 93% accuracy is very difficult which was able to be reached by using extra-long neural network requiring a huge amount of processing and time (XLNet is an advanced method that combines the advantages of Autoregressive and Autoencoding methods through a technique called permutation language modeling. The neural architecture of XLNet is specifically designed to perform better at Autoregressive tasks, such as Transformer-XL and the carefully crafted two-stream attention mechanism.)."Although XLNet
performed better in many ways, its longer training time may be an issue in circumstances when we need faster results but have limited resources or time to train the model"
However, my model seems to have an acceptable accuracy which is very usefull and applicable for general use unless the sufficient timea and processing equipments are available or needed.
---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [26]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [27]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [28]:
# Include your annotated out-of-domain data here