## Trying to train MS Phy Model for Classification

### Using the IMDB Dataset as a test case

We will add classification head to the Phy model. We will try and train without QLora first. See if it fits in memory

If we have problems we will need to use Peft but with regular trainer as it is a classification task instead of Text Gen

In [21]:
!pip install einops wandb

Collecting einops
  Downloading einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl.metadata (9.8 kB)
Collecting Click!=8.0.0,>=7.1 (from wandb)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl.metadata (12 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.39.1-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.9 kB)
Collecting appdirs>=1.4.3 (from wandb)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting protobuf!=4.21.0,<5,>=3.19.0 (from wandb)
  Downloading protobuf-4.25.1-cp37-abi3-manylinux2014_x86_64.wh

In [1]:
### imports
import pandas as pd
import torch
import numpy as np
from torch import nn
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    PreTrainedModel
)
from transformers.modeling_outputs import SequenceClassifierOutputWithPast
from typing import List, Optional, Tuple, Union
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from peft import LoraConfig, peft_model, get_peft_model, AutoPeftModelForCausalLM
from peft.tuners.lora import LoraLayer
# from trl import SFTTrainer # this is only needed when we Tune
from datasets import load_dataset, Dataset
from transformers.utils import ( add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
    is_flash_attn_greater_or_equal_2_10,
    logging,
    replace_return_docstrings)
import json

In [2]:
dataset = load_dataset("csv", data_files="./data/imdb_reviews/IMDB Dataset.csv")

In [3]:
dataset = dataset["train"].shuffle(42).train_test_split(0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 5000
    })
})

In [4]:
data = dataset["train"].train_test_split(0.2)
data, dataset

(DatasetDict({
     train: Dataset({
         features: ['review', 'sentiment'],
         num_rows: 36000
     })
     test: Dataset({
         features: ['review', 'sentiment'],
         num_rows: 9000
     })
 }),
 DatasetDict({
     train: Dataset({
         features: ['review', 'sentiment'],
         num_rows: 45000
     })
     test: Dataset({
         features: ['review', 'sentiment'],
         num_rows: 5000
     })
 }))

In [5]:
data["eval"] = data["test"]
data["test"] = dataset["test"]

In [6]:
dataset = data
data = None
del data
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 36000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 5000
    })
    eval: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 9000
    })
})

In [8]:
def count(row):
    return len(row['review'].split(' '))
pd.set_option("expand_frame_repr", False)
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(dataset["train"].shuffle()[:500])


In [43]:
df['num'] = df.apply(lambda row: count(row), axis=1)

In [44]:
df['num'].mean()

233.498

In [12]:
# some weights for unbalanced positive/ negative spread
pos_weights = len(dataset['train'].to_pandas()) / (2 * dataset['train'].to_pandas().sentiment.value_counts()['positive'])
neg_weights = len(dataset['train'].to_pandas()) / (2 * dataset['train'].to_pandas().sentiment.value_counts()['negative'])

In [21]:
df_small = pd.DataFrame(dataset["test"][444:450])
df_small

Unnamed: 0,review,sentiment
0,"As far as the Muppet line goes, however, this is not the best, nor the second best. This was marketed towards the kiddies, but has some dark, and emotionally upsetting adult moments, to which parents may not wish to expose their children. One of which showcases Miss Piggy going ""postal"" in a jealous rage, which lasts basically throughout the duration of this work.<br /><br />Beyond that, however, the story is progressive, and highly entertaining. One scene in which Joan Rivers and Miss PIggy go berserk in a department store is simply hilarious! And there are other parts of this work which contain the same level of levity and fun.<br /><br />I like this very much, and enjoy it still today.<br /><br />It rates a 7.6/10 from...<br /><br />the Fiend :.",positive
1,"I just can't understand the negative comments about this film. Yes it is a typical boy-meets-girl romance but it is done with such flair and polish that the time just flies by. Henstridge (talk about winning the gene-pool lottery!) is as magnetic and alluring as ever (who says the golden age of cinema is dead?) and Vartan holds his own.<br /><br />There is simmering chemistry between the two leads; the film is most alive when they share a scene - lots! It is done so well that you find yourself willing them to get together...<br /><br />Ignore the negative comments - if you are feeling a bit blue, watch this flick, you will feel so much better. If you are already happy, then you will be euphoric.<br /><br />(PS: I am 33, Male, from the UK and a hopeless romantic still searching for his Princess...)",positive
2,"Take a pinch of GOODFELLAS, mix it with THE GODFATHER, add some Roman mythology and plenty of lowbrow comedy, and you have THE SOPRANOS, about a mob clan operating out of northern New Jersey. It's almost as entertaining as pro wrestling. I am not the biggest fan of this show, but I do admire James Gandolfini's very complicated Tony Soprano, a psychopath with an occasional glimmer of conscience. I also have come to admire te contributions of folks like gravel-voiced Dom Chianese as the bewildered but murderous Uncle Junior, silver-haired Tony Sirico as the perpetually perplexed Paulie and the very beautiful Edie Falco as the duplicitous, tough-as-nails Carmela Soprano. The violence is sudden and graphic, the body count steadily climbs each season, but it is often the small moments that matter most here. Watch Paulie and Tony's nephew Christopher (Michael Imperioli late of LAW & ORDER) as they get lost in the Pine Barrens and sit out a bitter cold night in an abandoned trruck, both convinced they've had it.",positive
3,"what ever you do do not waste your time on this pointless. movie. A remake that did not need to be retold. Everyone coming out of the theater had the same comments. Worst movie I ever saw. Save your time and money!!!<br /><br />Nicgolas Cage was biking down hills, swimming in murky water and rolling down hills while being attacked by bees but yet his suit was still perfectly pressed and shirt crisp white until the very last scene.<br /><br />Although a good cast with Ellen Bernstein and Cage the acting was just as unbelievable as the movie itself. It is amazing how good actors can do such bad movies. Don't they get a copy of the script first. If you still have any interest at all in seeing the movie at the very least wait for it to come out on DVD.",negative
4,"One of the first OVA's (""original video animation"") I ever bought, this still has to be one of my favourite anime titles. A cyberpunk sci-fi action comedy set against an unlikely (for a comedy, that is) background of near-future pollution in a dystopian society.<br /><br />The ""heroes"" of Dominion are the Tank Police, formed with a ""if we can't beat crime, we'll get bigger guns"" philosophy, and who are, like the name suggests, patrolling the city in tanks instead of patrol cars, and who are actually far more dangerous than any criminals they are trying to catch. Most, if not all, of these cops are borderline(?) psychopaths and neurotics, giving new meaning to the phrase ""loose cannons"".<br /><br />Equally colourful and amusing are their adversaries, terrorist Buaku and his hench(wo)men, the Twin Cat Sisters, whose existence always seems to involve giving the Tank Police a hard time.<br /><br />The animation is not state of the art, but it's very nice otherwise; the colourful palette and cartoonish look of the characters and mecha fit nicely with the comedic atmosphere of Dominion.<br /><br />The English dubbing is, again, lots of fun. The soundtrack of the English version is also very good. I wonder if they ever made a soundtrack album of that...<br /><br />Anyway, Dominion Tank Police is great. It's Japanese cyberpunk SF with lots of comedy, filled with completely over-the-top characters and situations, making sure that it never takes itself seriously. Highly recommended.",positive
5,"Franco Rossi's 1985 six-hour Italian mini-series of Quo Vadis is a very curious beast, creating an absolutely convincing ancient Roman world shot in matter of fact fashion (very few long shots, no big cityscapes), but playing the drama down so much in favour of allusions to classical literature and history that the story constantly gets lost in the background.<br /><br />The shifting structure (much of episode one is played out via voice over letters) and lack of narrative urgency makes the full six-hour version simultaneously demanding and undemanding, and certainly far too often uninvolving, but it has something going for it. The two main strengths are the characterisation of Petronius (a thankfully dubbed Frederic Forrest, whose own voice would almost certainly flatten his dialogue) as a man whose spent so long looking for an astute angle to survive court life that he's become incapable of experiencing emotion, and Klaus Maria Brandauer's unique take on Nero as a wannabe actor whose every move and action is calculated on how his 'audience' will receive it. Elsewhere, Max Von Sydow briefly appears in a few episodes, being rewarded with the show's most impressive and genuinely moving scene here he encounters a child as he attempts to leave Rome. It's the kind of thing the show could do with more of, but it seems all too often to flatten every potentially emotional, inspiring or exciting moment under it's relentlessly low-key direction.<br /><br />Unfortunately Francesco Quinn makes a staggeringly anonymous hero, blending in with the walls and coming over less as a Roman officer than that quiet, slightly gormless but inoffensive guy who works in the same office as you who never says much at office parties - you know, the one who you think is called Dave or something like that. The budgetary limitations are very visible once its Meet the Lions time for the Christians and Ursus battle with the bull is so determinedly low key that it just passes over you before the show just abruptly loses interest and suddenly ends.<br /><br />Not a trip I can particularly recommend, I'm afraid, but if you do embark on it it's one not entirely without its small rewards.",negative


In [14]:
pos_weights, neg_weights # too evenly spread to bother

(0.9961217659246666, 1.003908550623762)

In [12]:
model_name = 'microsoft/phi-2'
model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_special_tokens=True)

ValueError: Unrecognized configuration class <class 'transformers_modules.microsoft.phi-2.d3186761bf5c4409f7679359284066c25ab668ee.configuration_phi.PhiConfig'> for this kind of AutoModel: AutoModelForSequenceClassification.
Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BloomConfig, CamembertConfig, CanineConfig, LlamaConfig, ConvBertConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FNetConfig, FunnelConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTJConfig, IBertConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaConfig, LongformerConfig, LukeConfig, MarkupLMConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MvpConfig, NezhaConfig, NystromformerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PerceiverConfig, PersimmonConfig, PhiConfig, PLBartConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, SqueezeBertConfig, T5Config, TapasConfig, TransfoXLConfig, UMT5Config, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.

In [68]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

In [46]:
model_name = 'bigscience/bloom-1b1'

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# model.config.pad_token_id = tokenizer.pad_token_id

NameError: name 'model_name' is not defined

In [8]:
# lets load it as base model for CausalLM
model_name = 'microsoft/phi-2'
base_model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2560)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-31): 32 x ParallelBlock(
        (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear4bit(in_features=2560, out_features=7680, bias=True)
          (out_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (mlp): MLP(
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
          (act): NewGELUActivation()
        )
      )

### OK No API for Seq Classification.

Lets override the Seq Classification Class from here.

We will load the model for CausalLM, then we will use it as base_model and hardcode the number of labels + outputs, as shown [here](https://colab.research.google.com/drive/1y_CFog1i97Ctwre41kUnKuTGFWgzGWte?usp=sharing#scrollTo=MY3ksrAdyHiG)

blog post [here](https://medium.com/mlearning-ai/microsoft-phi-2-for-classification-b83beaec2069)


In [9]:
# set some variables to use for the classification

NUM_LABELS = 2

class PhiPreTrainedModel(PreTrainedModel):
    config_class = base_model.config_class
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _skip_keys_device_placement = "past_key_values"
    _supports_flash_attn_2 = True
    _supports_cache_class = True

    def _init_weights(self, module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
                
#custom class - modified from PhiForSequenceClassification
# original class is here: 
# https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/models/phi/modeling_phi.py#L1165
class PhiForSequenceClassificationModified(PhiPreTrainedModel):
    def __init__(self, config, base_model, num_labels):
        super().__init__(config)
        self.num_labels = num_labels#changed
        self.model = base_model.transformer#changed
        self.score = nn.Linear(base_model.config.hidden_size, NUM_LABELS, bias=False)#changed

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embd.wte#changed

    def set_input_embeddings(self, value):
        self.model.embd.wte = value#changed

    @add_start_docstrings_to_model_forward("PHI_INPUTS_DOCSTRING")
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        model_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            past_key_values=past_key_values,

        )
        hidden_states = model_outputs#changed
        logits = self.score(hidden_states)
        # print(logits)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
                    logits.device
                )
            else:
                sequence_lengths = -1

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
        loss = None
        if labels is not None:
            labels = labels.to(logits.device)
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + model_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=None,
            hidden_states=None,
            attentions=None,
        )#changed




Problems - doesnt seem to work with few errors occuring. Lets try Bloom instead.

In [10]:
# model = PhiForSequenceClassificationModified(base_model.config, base_model, 2)

In [48]:
model, model.config.pad_token_id

(BloomForSequenceClassification(
   (transformer): BloomModel(
     (word_embeddings): Embedding(250880, 1536)
     (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
     (h): ModuleList(
       (0-23): 24 x BloomBlock(
         (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
         (self_attention): BloomAttention(
           (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
           (dense): Linear(in_features=1536, out_features=1536, bias=True)
           (attention_dropout): Dropout(p=0.0, inplace=False)
         )
         (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
         (mlp): BloomMLP(
           (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
           (gelu_impl): BloomGelu()
           (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
         )
       )
     )
     (ln_f): LayerNorm((1536,), eps=1e-0

### Try Mistral Also

Reading another blog article [here](https://huggingface.co/blog/Lora-for-sequence-classification-with-Roberta-Llama-Mistral#mistral), it seems that for classification smaller models perform better than LLM. So we can do e test case with Mistral which seems to have a proper Head for SeqClass on HF.

We may need to use LORA and Quantization, however as it seems it will not fit..

In [49]:
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
model.config.label2id = {'NEGATIVE': 0, 'POSITIVE': 1}

In [50]:
def process_data(example):
    item = tokenizer(example["review"], truncation=True, max_length=320) # see if this is OK for dyn padding
    item["labels"] = [ 1 if sent == 'positive' else 0 for sent in example["sentiment"]]
    return item

In [51]:
tokenised_data = dataset.map(process_data, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/36000 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/5000 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/9000 [00:00<?, ? examples/s]

In [52]:
tokenised_data = tokenised_data.remove_columns(["review", "sentiment"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [13]:
#tokenised_data["train"][3]['input_ids']

[44,
 47697,
 1119,
 51651,
 1620,
 661,
 4936,
 3638,
 7384,
 16401,
 1185,
 661,
 16916,
 4384,
 15,
 14600,
 55052,
 2194,
 791,
 1130,
 17,
 5361,
 1306,
 3466,
 14998,
 4143,
 140434,
 530,
 179501,
 2376,
 718,
 632,
 613,
 14216,
 1130,
 427,
 722,
 25739,
 427,
 5067,
 267,
 11940,
 2194,
 7496,
 5963,
 15,
 10118,
 26676,
 50827,
 4978,
 368,
 1230,
 2233,
 461,
 5553,
 59859,
 567,
 179034,
 5,
 47060,
 17,
 473,
 11602,
 1485,
 23346,
 15,
 4618,
 14216,
 71096,
 8610,
 368,
 4548,
 427,
 21525,
 6834,
 8621,
 25754,
 15,
 530,
 44556,
 36874,
 165132,
 1306,
 267,
 15422,
 86995,
 17,
 426,
 68136,
 361,
 35581,
 3121,
 14275,
 1728,
 3760,
 1881,
 15,
 368,
 18210,
 8876,
 1620,
 71429,
 5276,
 217597,
 2292,
 613,
 14216,
 361,
 24955,
 14779,
 17303,
 38319,
 15,
 530,
 14216,
 361,
 9897,
 3866,
 3808,
 36830,
 16622,
 1427,
 3291,
 45240,
 6364,
 36830,
 16045,
 17123,
 427,
 42488,
 3808,
 35076,
 69,
 16489,
 17,
 21998,
 632,
 66818,
 999,
 7963,
 6147,
 1119,
 2597

In [53]:
training_arguments = TrainingArguments(
    output_dir="./data/finetuned_classifier_bloom_wEval",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    optim="adamw_torch",
    evaluation_strategy="steps",
    logging_steps=5,
    learning_rate=1e-5,
    max_grad_norm = 0.3,
    eval_steps=0.2,
    num_train_epochs=2,
    warmup_ratio= 0.1,
    # group_by_length=True,
    fp16=False,
    weight_decay=0.001,
    lr_scheduler_type="constant",
)

In [54]:
peft_model = get_peft_model(model, LoraConfig(
                            task_type="SEQ_CLS",
                            r=16,
                            lora_alpha=16,
                            target_modules=[
                                'query_key_value',
                                'dense'
                            ],
                            bias="none",
                            lora_dropout=0.05, # Conventional
                        ))
peft_model.print_trainable_parameters()

trainable params: 3,542,016 || all params: 1,068,859,392 || trainable%: 0.3313827830405592


In [17]:
peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BloomForSequenceClassification(
      (transformer): BloomModel(
        (word_embeddings): Embedding(250880, 1536)
        (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (h): ModuleList(
          (0-23): 24 x BloomBlock(
            (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
            (self_attention): BloomAttention(
              (query_key_value): lora.Linear(
                (base_layer): Linear(in_features=1536, out_features=4608, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1536, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4608, bias=False)
           

In [10]:
class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

In [55]:
import evaluate

def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

In [56]:
trainer = Trainer(
    peft_model,
    training_arguments,
    train_dataset=tokenised_data["train"],
    eval_dataset=tokenised_data["eval"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [37]:
# for name, module in trainer.model.named_modules():
#     if "norm" in name:
#         module = module.to(torch.float32)

In [57]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/jovyan/.netrc


You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Precision,Recall,F1-score,Accuracy
900,0.7843,0.617776,0.875613,0.876394,0.876004,0.876444
1800,0.3354,0.378954,0.949876,0.854083,0.899436,0.904889
2700,0.1195,0.231434,0.928794,0.931281,0.930036,0.930222
3600,0.3597,0.216521,0.933064,0.926818,0.929931,0.930444
4500,0.2007,0.204484,0.929732,0.944668,0.93714,0.936889


TrainOutput(global_step=4500, training_loss=0.5262089767704408, metrics={'train_runtime': 3158.4242, 'train_samples_per_second': 22.796, 'train_steps_per_second': 1.425, 'total_flos': 8.953551858696192e+16, 'train_loss': 0.5262089767704408, 'epoch': 2.0})

In [14]:
trainer.evaluate()

{'eval_loss': 0.21896573901176453,
 'eval_runtime': 154.7007,
 'eval_samples_per_second': 80.801,
 'eval_steps_per_second': 20.2,
 'epoch': 3.0}

In [58]:
trainer.save_model('./data/finetuned_classifier_bloom_wEval')

### Checking the Model

In [3]:
!pip install -U bitsandbytes



In [2]:
model_name = './data/finetuned_classifier_bloom_wEval/'
loaded_model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                                  trust_remote_code=True, 
                                                                  num_labels=2,
                                                                  device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Some weights of BloomForSequenceClassification were not initialized from the model checkpoint at bigscience/bloom-1b1 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
loaded_model.push_to_hub('imdb_tuned-bloom1b1-sentiment-classifier')



adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/snoop088/imdb_tuned-bloom1b1-sentiment-classifier/commit/bedf168fdaf0a3e30c37c91012b0b0792b3f8525', commit_message='Upload BloomForSequenceClassification', commit_description='', oid='bedf168fdaf0a3e30c37c91012b0b0792b3f8525', pr_url=None, pr_revision=None, pr_num=None)

In [9]:
tokenizer.push_to_hub('imdb_tuned-bloom1b1-sentiment-classifier')

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/snoop088/imdb_tuned-bloom1b1-sentiment-classifier/commit/91020a3819bd9b053ccaebec0bf88d34bfe56f38', commit_message='Upload tokenizer', commit_description='', oid='91020a3819bd9b053ccaebec0bf88d34bfe56f38', pr_url=None, pr_revision=None, pr_num=None)

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
print_trainable_parameters(loaded_model)

trainable params: 0 || all params: 1068856320 || trainable%: 0.0


#### Manually Creating a Simple DataFrame of Reviews

I can later try and create an app that will take in dates, sort method and try and scrape reviews from IMDB to create movie recommendation lists.

But lets test the model first with manual list of 3-5 reviews.

In [6]:
import json
# load data using Python JSON module
with open('./data/manual_movies.json','r') as f:
    data = json.loads(f.read())
df_manual = pd.json_normalize(data, record_path=['movies'])
df_manual.to_csv('./data/df_manual.csv')

In [None]:
my_set = pd.read_csv("./data/df_manual.csv")

In [23]:

# input = tokenizer(df_small["review"], return_tensors="pt")
# output = loaded_model(**input)
inputs = tokenizer(list(df_manual["review"]), truncation=True, padding="max_length", max_length=256,  return_tensors="pt")
outputs = loaded_model(**inputs)

In [24]:
np.argmax(outputs.logits, axis=-1)

tensor([1, 1, 1, 0, 0, 0])

### Time to load scraped data from IMDB

Lets create a test with real IMDB data. We should load the data and extract the reviews together with the movie title.

In [7]:
import json
with open('./data/scraped_01-03_01-11.complete.json','r') as f:
    movie_data = json.loads(f.read())
loaded_movies = movie_data["movies"]
# for movie in loaded_movies:
#     review_texts = []
#     for review in movie["reviews"]:
#         review_texts.append(review["copy"])
    

In [4]:
df = pd.DataFrame(loaded_movies)
df_exploded = df.explode(["reviews"])
df_exploded.head()

Unnamed: 0,title,stars,link,meta,votes,type,reviews
0,Gladiator II,6.9,https://www.imdb.com/title/tt9218128/reviews,64,78K,Movie,{'copy': 'There seems to be a trend these days...
0,Gladiator II,6.9,https://www.imdb.com/title/tt9218128/reviews,64,78K,Movie,{'copy': 'I tried hard not to just compare #2 ...
0,Gladiator II,6.9,https://www.imdb.com/title/tt9218128/reviews,64,78K,Movie,{'copy': 'The film offers a thrilling experien...
0,Gladiator II,6.9,https://www.imdb.com/title/tt9218128/reviews,64,78K,Movie,{'copy': 'Now as i watched the movie i truly t...
0,Gladiator II,6.9,https://www.imdb.com/title/tt9218128/reviews,64,78K,Movie,{'copy': 'Didnt get the same feeling I got bac...


In [3]:
def process(movie):
    reviews = movie["reviews"]
    review_copy, review_stars = zip(*map(lambda review: (review["copy"], review["stars"]), reviews))
    return {
        "title": movie["title"],
        "review_copy": list(review_copy),
        "review_stars": list(review_stars)
    }

In [4]:
def get_rating(movie, model, tokenizer, device):
    outputs = []
    reviews = [review["copy"] for review in movie["reviews"]]
    stars = [review["stars"] for review in movie["reviews"]]
    total = len(reviews)
    inputs = tokenizer(reviews, 
                       truncation=True, 
                       padding="max_length", 
                       max_length=256,  
                       return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model(**inputs)
    total_sentiment = np.argmax(torch.Tensor.cpu(outputs.logits), axis=-1).sum()
    return {"total_sentiment": f"{total_sentiment.item()} / {total}", 
            "total_stars": f"{np.array(stars).sum()} / {total * 10}"
           }

In [5]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [8]:
processed = list(map(process, loaded_movies))
processed_reviews = []
movies_with_sentiment = []
for processed_movie in processed:
    processed_reviews.extend(processed_movie["review_copy"])

inputs = tokenizer(processed_reviews, truncation=True, padding="max_length", max_length=256,  return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = loaded_model(**inputs)
outcome = np.argmax(torch.Tensor.cpu(outputs.logits), axis=-1)
sentiment = ['positive' if out == 1 else 'negative' for out in outcome]
# for (i,item) in enumerate(sentiment):


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.86 GiB. GPU 0 has a total capacity of 23.54 GiB of which 1.41 GiB is free. Process 1693599 has 21.06 GiB memory in use. Of the allocated memory 20.58 GiB is allocated by PyTorch, and 28.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [12]:
get_rating(loaded_movies[2], loaded_model, tokenizer, DEVICE)

{'total_sentiment': '3 / 15', 'total_stars': '60 / 150'}

In [8]:
movies = [{"title": movie["title"],
           "stars": movie["stars"], 
           "meta": movie["meta"],
           "votes": movie["votes"],
           "type": movie["type"],
           "link": movie["link"].split("review")[0],
           **get_rating(movie, loaded_model, tokenizer, DEVICE)} for movie in loaded_movies]
movies_df = pd.DataFrame(movies)

In [9]:
movies_df

Unnamed: 0,title,stars,meta,votes,type,link,total_sentiment,total_stars
0,Gladiator II,6.9,64.0,78K,Movie,https://www.imdb.com/title/tt9218128/,3 / 10,58 / 100
1,Dune: Prophecy,7.4,,9.6K,TV Series,https://www.imdb.com/title/tt10466872/,7 / 10,65 / 100
2,The Penguin,8.7,,123K,TV Mini Series,https://www.imdb.com/title/tt15435876/,10 / 10,95 / 100
3,Deadpool & Wolverine,7.7,56.0,400K,Movie,https://www.imdb.com/title/tt6263850/,7 / 10,79 / 100
4,The Substance,7.4,78.0,148K,Movie,https://www.imdb.com/title/tt17526714/,4 / 10,71 / 100
5,Twisters,6.5,65.0,130K,Movie,https://www.imdb.com/title/tt12584954/,5 / 10,57 / 100
6,Heretic,7.2,71.0,19K,Movie,https://www.imdb.com/title/tt28015403/,10 / 10,76 / 100
7,Smile 2,6.9,66.0,52K,Movie,https://www.imdb.com/title/tt29268110/,6 / 10,66 / 100
8,Anora,8.2,91.0,23K,Movie,https://www.imdb.com/title/tt28607951/,9 / 10,85 / 100
9,Disclaimer,7.5,,18K,TV Mini Series,https://www.imdb.com/title/tt16294384/,7 / 10,68 / 100


In [10]:
movies_df.to_csv('./data/movies_sentiment_01.03-01.11.csv')