<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/course_project_Reza2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari
- Date: June 2024
- Chosen Corpus: imdb
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification. A set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
- Paper(s) and other published materials related to the corpus: Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
- State-of-the-art performance (best published results) on this corpus: Achieving an f1-score of 77.66% and an accuracy of 77.81% to predict positive or negaitve sentiments according to the recent publicaiton in 2023 of "Opinion classification for IMDb review based using naive bayes method" (https://doi.org/10.1063/5.0171628).

---

## 1. Setup

In [39]:
# Your code to install and import libraries etc. here
!pip3 install -q transformers[torch] datasets evaluate optuna plotly

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [41]:
# Your code to download the corpus here
from pprint import pprint #pprint
import datasets
dset1=datasets.load_dataset("imdb")
pprint(dset1)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [42]:
dset=dset1.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
del dset["unsupervised"] #Delete the unlabeled part of the dataset, we don't need it for anything

In [43]:
pprint(dset['train'][0]['text'])
print(dset['train'][0]['label'])

('Anna (Charlotte Burke), who is just on the verge of puberty, begins to have '
 'strange dreams which start affecting her in real life--especially involving '
 'a boy named Mark (Elliott Spiers) who she meets in her dreams.<br /><br '
 '/>Very unusual fantasy with some truly terrifying moments. Despite the fact '
 'that this is about a teenage girl and has a PG-13 rating, this is NOT for '
 "children. Also, if you hate fantasies stay far away. But if you're game for "
 'something different this fits the bill.<br /><br />Well directed by Bernard '
 'Rose with a just beautiful music score and a few nice, scary jolts. The only '
 "thing that prevents this from being a really great movie is Burke--she's not "
 "a very good actress (it's no surprise that this has been her only film) and "
 'it hurts the movie. However, everybody else is just great.<br /><br />Spiers '
 'is very good as Mark; Glenne Headley (faking a British accent very well) is '
 "also very good as Anna's mother and Ben C

### 2.2. Preprocessing

In [44]:
#Tokenize and map vocabulary
import sklearn.feature_extraction

vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in dset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary

In [45]:
#Building the feature vectors
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["text"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dset["train"][0])

In [46]:
print(vectorized)

{'input_ids': array([   23,   122,   309,   342,   431,   566,   774,   887,   937,
        1207,  1456,  1517,  1732,  1756,  1778,  1793,  1829,  1865,
        1942,  2248,  2285,  2295,  2415,  2564,  2604,  2625,  3076,
        3174,  4345,  4971,  5104,  5161,  5559,  5934,  5942,  6245,
        6316,  6584,  6594,  6621,  6658,  6662,  6663,  6706,  6825,
        6878,  6952,  7127,  7336,  7346,  7472,  7682,  7801,  7932,
        8289,  8303,  8322,  8476,  8602,  8783,  8857,  8929,  9085,
        9560,  9602,  9630,  9804,  9890, 10449, 11008, 11228, 11605,
       11725, 11762, 11832, 11900, 12079, 12134, 12202, 12363, 12437,
       12445, 13139, 13721, 13980, 14282, 14310, 14332, 14349, 14583,
       15188, 15496, 15557, 15919, 16161, 16312, 16542, 16550, 16921,
       16951, 17092, 17443, 17574, 17766, 17848, 17893, 17897, 17951,
       17968, 18115, 18166, 18468, 18908, 19088, 19112, 19509, 19549,
       19559, 19590, 19712, 19927], dtype=int32)}


In [47]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forgot we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('13, 1988, about, accent, actress, affecting, also, and, anna, as, away, bad, '
 'beautiful, been, begins, being, ben, bernard, bill, both, boy, br, british, '
 'burke, but, by, charlotte, children, cross, despite, different, directed, '
 'dreams, elliott, else, especially, everybody, fact, faded, faking, '
 'fantasies, fantasy, far, father, few, film, fits, for, frightening, from, '
 'game, girl, good, great, has, hate, have, her, hit, however, hurts, if, in, '
 'involving, is, it, jolts, just, life, mark, meets, moments, mother, movie, '
 'music, named, nice, no, not, of, on, only, pg, prevents, puberty, rating, '
 're, real, really, released, rose, scary, score, she, since, sleeper, some, '
 'something, start, stay, strange, surprise, sympathetic, teenage, terrifying, '
 'that, the, thing, this, to, too, truly, unusual, verge, very, well, when, '
 'which, who, with, you')


In [48]:
# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dset.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

{'input_ids': [23,
               122,
               309,
               342,
               431,
               566,
               774,
               887,
               937,
               1207,
               1456,
               1517,
               1732,
               1756,
               1778,
               1793,
               1829,
               1865,
               1942,
               2248,
               2285,
               2295,
               2415,
               2564,
               2604,
               2625,
               3076,
               3174,
               4345,
               4971,
               5104,
               5161,
               5559,
               5934,
               5942,
               6245,
               6316,
               6584,
               6594,
               6621,
               6658,
               6662,
               6663,
               6706,
               6825,
               6878,
               6952,
               7127,
  

In [49]:
#Padding and Collation (forming a batch)
import torch

def collator(list_of_examples):
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))} #this is easy, labels are made into a single tensor
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it

In [51]:
#Build a batch from 2 examples, with padding
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
pprint(batch["labels"])
pprint(batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 183])
tensor([0, 0])
tensor([[  151,   186,   198,   271,   309,   396,   419,   431,   594,   609,
           670,   727,   860,   887,  1001,  1157,  1173,  1207,  1222,  1300,
          1702,  1737,  1807,  2232,  2292,  2604,  2625,  2691,  2864,  2878,
          2880,  3386,  3591,  3775,  3788,  3894,  4647,  4965,  5028,  5113,
          5196,  5372,  5490,  5519,  5559,  6044,  6245,  6302,  6311,  6315,
          6706,  6782,  6878,  7066,  7067,  7127,  7134,  7346,  7695,  7777,
          7801,  8206,  8224,  8241,  8289,  8322,  8350,  8476,  8504,  8567,
          8592,  8657,  8714,  8942,  9052,  9085,  9162,  9485,  9602,  9626,
          9630,  9890,  9987, 10000, 10097, 10248, 10308, 10398, 10623, 10642,
         10644, 10871, 11176, 11181, 11267, 11460, 11721, 11762, 11778, 11857,
         12055, 12202, 12212, 12363, 12364, 12427, 12437, 12445, 12564, 12887,
         12891, 13051, 13323, 13325, 13624

---

## 3. Machine learning model

### 3.1. Model training

In [53]:
#Build the MLP model
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)

        #### MODIFIED HERE FOR EXERCISE 5 -> commented out
        ####projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)

        #### MODIFIED HERE FOR EXERCISE 5 -> base it off embedded_summed
        ##### OLD: logits=self.output(projected)
        logits=self.output(embedded_summed)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

In [54]:
# Configure the model:
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=1,nlabels=2)

In [55]:
#Make a model
mlp=MLP(mlp_config)
fake_batch=collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.7118, grad_fn=<NllLossBackward0>),
 tensor([[ 0.2545, -0.1000],
         [ 0.2526, -0.0871]], grad_fn=<AddmmBackward0>))

In [56]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
eval_strategy=steps,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp



In [57]:
# A metric for evaluating performance
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

In [58]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.7081,0.669247,0.573
1000,0.6213,0.594076,0.71
1500,0.5384,0.519729,0.801
2000,0.4662,0.462049,0.831
2500,0.4097,0.419929,0.846
3000,0.3672,0.389202,0.864
3500,0.3336,0.367135,0.868
4000,0.3083,0.349183,0.869
4500,0.2858,0.336068,0.87
5000,0.2672,0.324319,0.873


TrainOutput(global_step=15500, training_loss=0.26043426070674774, metrics={'train_runtime': 595.6354, 'train_samples_per_second': 4297.931, 'train_steps_per_second': 33.578, 'total_flos': 20270784960.0, 'train_loss': 0.26043426070674774, 'epoch': 79.08163265306122})

In [59]:
#Evaluation
eval_results = trainer.evaluate(dset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.2803328037261963, 'eval_accuracy': 0.88628, 'eval_runtime': 8.8369, 'eval_samples_per_second': 2829.057, 'eval_steps_per_second': 353.632, 'epoch': 79.08163265306122}


### 3.2 Hyperparameter optimization

### 3.3. Evaluation on test set

In [13]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, 1e-1, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 64, 128, 256])

    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints", #save checkpoints here
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, #learning rate of the gradient descent
        max_steps=10000, #I will keep this small, no time to wait too long :)
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    mlp = MLP(mlp_config)

    trainer = transformers.Trainer(
        model=mlp,
        args=trainer_args,
        train_dataset=dset_tokenized["train"],
        eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
        compute_metrics=compute_accuracy,
        data_collator=collator,
        callbacks=[early_stopping]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=7) #I will keep this small, no time to wait :)

In [None]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")

In [None]:
import optuna.visualization as vis
fig = vis.plot_parallel_coordinate(study)
fig.show()


In [None]:
fig = vis.plot_slice(study)
fig.show()

In [None]:
fig = vis.plot_contour(study, params=['learning_rate', 'batch_size'])
fig.show()

In [None]:
fig = vis.plot_optimization_history(study)
fig.show()

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [None]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [None]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [None]:
# Include your annotated out-of-domain data here