### Fine Tuning Sentiment Analysis 

### Importing Libraries :

Importing required libraries and modules, including NumPy, Hugging Face Transformers, Datasets, pprint, TorchInfo, and the Trainer module from Transformers for sequence classification.

In [1]:
import numpy as np

from transformers import AutoTokenizer
from datasets import load_dataset
from pprint import pprint
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification
from torchinfo import summary
from transformers import Trainer
from datasets import load_metric





In [2]:
dataframe = load_dataset('rotten_tomatoes')

In [3]:
dataframe

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [4]:
dataframe['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

In [5]:
dir(dataframe['train'])

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getitems__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_estimate_nbytes',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_generate_tables_from_cache_file',
 '_generate_tables_from_shards',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_save_to_disk_single',
 '_select_contiguous',
 '_select_wi

In [6]:
dataframe.data

{'train': MemoryMappedTable
 text: string
 label: int64
 ----
 text: [["the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .","the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .","effective but too-tepid biopic","if you sometimes like to go to the movies to have fun , wasabi is a good place to start .","emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",...,"an uplifting , near-masterpiece .","superior genre storytelling , which gets under our skin simply by crossing the nuclear line .","by taking entertainment tonight subject matter and giving it humor and poignancy , auto focus becomes both gut-bustingly funny and crushingly de

In [7]:
dataframe['train'][0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [8]:
dataframe['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

Loading a tokenizer for the 'distilbert-base-uncased' model using the Hugging Face Transformers library. The tokenizer is created with `AutoTokenizer.from_pretrained(checkpoint)`.

In [9]:
checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Tokenizing the text content of the first three rows in the 'text' column of the 'train' subset of the dataframe using the tokenizer and printing the tokenized result.

In [10]:
sentence_tokenized = tokenizer(dataframe['train'][0:3]['text'])

pprint(sentence_tokenized)

{'attention_mask': [[1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                     1,
                

Defining a `tokenize_function` that tokenizes text in batches from a dataframe using the provided tokenizer with truncation enabled. Applying this function to a dataframe (`dataframe`) using the `map` method with batch processing (`batched=True`), and storing the tokenized result in `tokenized_dataframe`.

In [11]:
def tokenize_function(batch):
    return tokenizer(batch['text'], truncation=True)

tokenized_dataframe = dataframe.map(tokenize_function, batched=True)

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Creating a `TrainingArguments` object named `training_args` with the output directory set to 'trainer_log', evaluation strategy set to 'epoch', save strategy set to 'epoch', and training for one epoch (`num_train_epochs = 1`).

In [12]:
training_args = TrainingArguments(
    'trainer_log', 
    evaluation_strategy = 'epoch', 
    save_strategy = 'epoch', 
    num_train_epochs = 1
)

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =  2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

Iterating through the named parameters of a PyTorch model, `model`, and storing their detached numpy values in the list `params_before`.

In [15]:
params_before = []
for name, p in model.named_parameters():
    params_before.append(p.detach().numpy())

Defining a `metric_function` to calculate accuracy between predictions and references using `accuracy_score`. Creating a `compute_metrics` function that computes accuracy based on logits and labels, and configuring a `Trainer` instance with training and evaluation datasets, tokenizer, and metric computation. Initiating training using `trainer.train()`.

In [17]:
from sklearn.metrics import accuracy_score

def metric_function(predictions, references):
    return accuracy_score(references, predictions)

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels 
    predictions = np.argmax(logits, axis=-1)
    
    accuracy = accuracy_score(labels, predictions)
    
    return {"accuracy": accuracy}


trainer = Trainer(
    model, 
    training_args, 
    train_dataset=tokenized_dataframe['train'], 
    eval_dataset=tokenized_dataframe['validation'], 
    tokenizer=tokenizer, 
    compute_metrics=compute_metrics
)

trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2504,0.53002,0.846154


TrainOutput(global_step=1067, training_loss=0.2381622936531217, metrics={'train_runtime': 4656.3906, 'train_samples_per_second': 1.832, 'train_steps_per_second': 0.229, 'total_flos': 97956536601456.0, 'train_loss': 0.2381622936531217, 'epoch': 1.0})

In [18]:
trainer.save_model('model.h5')

In [20]:
from transformers import pipeline

classifier = pipeline('text-classification', model = 'model.h5')

In [21]:
classifier('That movie was fucking awesome.')

[{'label': 'LABEL_1', 'score': 0.9894116520881653}]