# File Loads

In [49]:
# Other GPU memory troubleshooting commands
#import torch
#torch.cuda.memory_stats() 
#torch.cuda.empty_cache()
#torch.cuda.memory_summary(device=None, abbreviated=False)

In [50]:
#Needed to avoid running out of GPU memory
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"

In [51]:
!nvidia-smi

/bin/bash: /home/littlepenguin/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Sun Oct 22 14:53:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   29C    P8    11W / 170W |   4049MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                   

In [52]:
import torch
torch.cuda.is_available()

True

In [105]:
from datasets import load_dataset

imdb_ds_dict = load_dataset("csv", data_files="movie.csv")
imdb_ds = imdb_ds_dict['train']
imdb_ds = imdb_ds.train_test_split(test_size=0.3)
imdb_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 28000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12000
    })
})

In [77]:
imdb_train_ds = imdb_ds["train"]
imdb_train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 28000
})

In [78]:
imdb_test_ds = imdb_ds["test"]
imdb_test_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 12000
})

# Configuration

Each model configuration has different attributes; for instance, all NLP models have the hidden_size, num_attention_heads, num_hidden_layers and vocab_size attributes in common. These attributes specify the number of attention heads or hidden layers to construct a model with.

These are the default DistilBert configuration:

In [79]:
from transformers import BertConfig

In [80]:
config = BertConfig()
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



The code below changes the activation method and the dropout rate:

In [81]:
my_config = BertConfig(activation="relu", attention_dropout=0.4)
print(my_config)

BertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Once the model is configured, it can be saved:

In [82]:
my_config.save_pretrained(save_directory="./Bert-HuggingFace")

# Model Creation

The model defines what each layer is doing and what operations are happening. Attributes like num_hidden_layers from the configuration are used to define the architecture. Every model shares the base class PreTrainedModel and a few common methods like resizing input embeddings and pruning self-attention heads. 

Load your custom configuration attributes into the model.  For sentiment analysis, the Sequence Classification variation is needed:

In [83]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification(my_config)

This creates a model with random values instead of pretrained weights. You won’t be able to use this model for anything useful yet until you train it. Training is a costly and time-consuming process. It is generally better to use a pretrained model to obtain better results faster, while using only a fraction of the resources required for training.

At this point, you have a base BERT model which outputs the hidden states. The hidden states are passed as inputs to a model head to produce the final output. 🤗 Transformers provides a different model head for each task as long as a model supports the task (i.e., you can’t use DistilBERT for a sequence-to-sequence task like translation).

The last base class you need before using a model for textual data is a tokenizer to convert raw text to tensors. There are two types of tokenizers you can use with 🤗 Transformers:
    - PreTrainedTokenizer: a Python implementation of a tokenizer.
    - PreTrainedTokenizerFast: a tokenizer from our Rust-based 🤗 Tokenizer library.
    
If you trained your own tokenizer, you can create one from your vocabulary file:

It is important to remember the vocabulary from a custom tokenizer will be different from the vocabulary generated by a pretrained model’s tokenizer. You need to use a pretrained model’s vocabulary if you are using a pretrained model, otherwise the inputs won’t make sense. Create a tokenizer with a pretrained model’s vocabulary with the BertTokenizer class.

Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate longer sequences in the text to be no longer than the model’s maximum input length:

In [84]:
from transformers import AutoTokenizer
my_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Using the vocab file from the wikipedia data set resulted in bad results.  Used the matching Bert AutoTokenizer to get a working model

It is important to remember the vocabulary from a custom tokenizer will be different from the vocabulary generated by a pretrained model’s tokenizer. You need to use a pretrained model’s vocabulary if you are using a pretrained model, otherwise the inputs won’t make sense.

In [85]:
def preprocess_function(examples):
    return my_tokenizer(examples["text"], padding=True, truncation=True,max_length=50)

In [86]:
# Use 🤗 Datasets map function to apply the preprocessing function to the entire dataset. You can also set batched=True to 
#apply the preprocessing function to multiple elements of the dataset at once for faster preprocessing:

tokenized_imdb_train = imdb_ds["train"].map(preprocess_function, batched=True)
tokenized_imdb_test = imdb_ds["test"].map(preprocess_function, batched=True)

Map:   0%|          | 0/28000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

In [87]:
tokenized_imdb_train

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 28000
})

In [88]:
tokenized_imdb_test

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 12000
})

Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, it is more efficient to only pad the text to the length of the longest element in its batch. This is known as dynamic padding. You can do this with the DataCollatorWithPadding function:

# Fine-tune with TensorFlow

Start by batching the processed examples together with dynamic padding using the DataCollatorWithPadding function. Make sure you set return_tensors="tf" to return tf.Tensor outputs instead of PyTorch tensors!

In [89]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(my_tokenizer)

Next, convert your datasets to the tf.data.Dataset format with to_tf_dataset. Specify inputs and labels in the columns argument:

Define the metrics you will be using to evaluate how good is the fine-tuned model"

In [90]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

In [91]:
from transformers import TrainingArguments, Trainer

In [92]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)


In [93]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb_train,
    eval_dataset=tokenized_imdb_test,
    tokenizer=my_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [94]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.707
1000,0.7077
1500,0.6811
2000,0.5984
2500,0.5262
3000,0.5209
3500,0.5087
4000,0.4527
4500,0.441
5000,0.4479


TrainOutput(global_step=8750, training_loss=0.4820475533621652, metrics={'train_runtime': 969.2456, 'train_samples_per_second': 144.442, 'train_steps_per_second': 9.028, 'total_flos': 3597221460000000.0, 'train_loss': 0.4820475533621652, 'epoch': 5.0})

In [95]:
trainer.evaluate()

{'eval_loss': 0.5219933390617371,
 'eval_accuracy': 0.7640833333333333,
 'eval_f1': 0.7613990729034977,
 'eval_runtime': 21.3179,
 'eval_samples_per_second': 562.907,
 'eval_steps_per_second': 35.182,
 'epoch': 5.0}

In [96]:
trainer.save_model("./Bert-HuggingFace/finetune-imdb")

In [102]:
from transformers import pipeline

sentiment_classification = pipeline('text-classification',"/home/littlepenguin/Git/MS-BAIS/2023-08-IndStudy-Transformers/HuggingFace/Bert-HuggingFace/finetune-imdb")

sentiment_classification(["I don't know", 
                          "I don't care",
                          "I hate horror movies"])

# LABEL_0 = negative
# LABEL_1 = positive

#  These were also labeled as negative in the unsupervised model

[{'label': 'LABEL_0', 'score': 0.8385729789733887},
 {'label': 'LABEL_0', 'score': 0.8936507105827332},
 {'label': 'LABEL_0', 'score': 0.9752849340438843}]

In [104]:
sentiment_classification(["So great; I loved it!",
                            "This is average!",
                            "This is average",
                            "This is a nightmare!",
                            "This is a dream!",
                            "I love my dog"])

#  These were also labeled as positive in the unsupervised model.  The scores are different.  

[{'label': 'LABEL_1', 'score': 0.978111743927002},
 {'label': 'LABEL_1', 'score': 0.9714835286140442},
 {'label': 'LABEL_1', 'score': 0.9659692049026489},
 {'label': 'LABEL_1', 'score': 0.9689449667930603},
 {'label': 'LABEL_1', 'score': 0.9133505821228027},
 {'label': 'LABEL_1', 'score': 0.9779731631278992}]