<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/Language_Modelling%5BMLM%2BCLM%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn
!pip install accelerate

In [8]:
%%bash
nvidia-smi
python --version
which nvcc
echo $PATH
echo $LD_LIBRARY_PATH


Thu Jul 28 14:56:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [10]:
import transformers,torch,accelerate
transformers.__version__,torch.__version__,accelerate.__version__

('4.21.0', '1.12.0+cu113', '0.11.0')

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification,AutoModelForMaskedLM
import torch
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [12]:
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

## Pre processing the data

In [13]:
from datasets import load_dataset
imdb_dataset = load_dataset("imdb")
imdb_dataset

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [14]:
imdb_dataset['train'][0]

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

Tokenize the whole corpus and add one key containing the word_ids of the tokens

In [15]:
def tokenize_text(batch_of_data):
  batch_encoding = tokenizer(batch_of_data['text'])
  word_ids = []
  for idx,each_input_id in enumerate(batch_encoding['input_ids']):
    word_ids.append(batch_encoding.word_ids(idx))
  batch_encoding['word_ids'] = word_ids
  return batch_encoding

In [16]:
tokenized_datasets = imdb_dataset.map(
    tokenize_text, batched=True, remove_columns=["text", "label"]
)

  0%|          | 0/25 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [20]:
tokenized_datasets,len(tokenized_datasets['train'][0]['word_ids']),len(tokenized_datasets['train'][0]['input_ids']),len(tokenized_datasets['train']['word_ids'])

(DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
}),
 363,
 363,
 25000)

Now, concatenate all the tokens and then chunk it to a context_length. Here, I have taken context_length of 128 tokens considering GPU memory of google colab

In [24]:
# let's take 3 sample and print out the number of tokens per review:
sample = tokenized_datasets['train'][:3]
for idx,each_sample in enumerate(sample['input_ids']):
  print(f"review {idx} has {len(each_sample)} tokens")
sample.keys(),

review 0 has 363 tokens
review 1 has 304 tokens
review 2 has 133 tokens


(dict_keys(['input_ids', 'attention_mask', 'word_ids']),)

In [31]:
concatenated_example ={}
for key in sample.keys():
  concatenated_example[key] = sum(sample[key],[])


In [32]:
len(sum(sample['input_ids'],[])),len(concatenated_example['input_ids'])

(800, 800)

In [38]:
[_ for _ in range(0,20,5)][:3]

[0, 5, 10]

In [40]:
# now let's chunk it to 128
chunked_sample ={}
chunk_size = 128
for i in range(0,len(concatenated_example['input_ids']),chunk_size):
  print(i,i+chunk_size)


0 128
128 256
256 384
384 512
512 640
640 768
768 896


In [45]:
temp_list = []
for k,v in concatenated_example.items():
  for i in range(0,len(concatenated_example['input_ids']),chunk_size):
    temp_list.append(v[i:i+chunk_size])
  chunked_sample[k] = temp_list
  temp_list = []

In [47]:
chunked_sample.keys(),len(chunked_sample['input_ids']),[len(_) for _ in chunked_sample['input_ids']]

(dict_keys(['input_ids', 'attention_mask', 'word_ids']),
 7,
 [128, 128, 128, 128, 128, 128, 32])

as we are seeing the last chunk is smaller than the chunk size. There are two options for this : either to drop the last chunk or to pad to the chunk size. I will drop the last chuk here.

In [51]:
800/128,(800//128)*128,800//128

(6.25, 768, 6)

In [None]:
def concatenate_and_chunk(batch_of_data):
  concatenated_data = {}
  