**Task 2 :**
Explore the distinctions between prefix language modelling (conditional language modelling) and
autoregressive language modelling and describe in short paragraphs.

Develop a prefix language model using Hugging Face and PyTorch.
You can pick any dataset for a creative text generation task and you should report the perplexity metric.


** Explore the distinctions between prefix language modelling (conditional language modelling) and autoregressive language modelling and describe in short paragraphs.**

**Prefix Language Modeling (Conditional Language Modeling):**
Prefix language modeling generates token sequences based on provided context or prefix. The model predicts the next token in the sequence using the context given by the prefix. This method is useful for tasks where context is explicitly known or provided, enabling the model to generate coherent and contextually relevant sequences.

**Autoregressive Language Modeling:**
Autoregressive language modeling predicts the next token in a sequence based on previously generated tokens. The model generates tokens one at a time, with each prediction conditioned on tokens generated before it. This approach is commonly used for text generation tasks, producing fluent and coherent sequences of text.

**Distinctions:**
The distinction between prefix language modeling and autoregressive language modeling lies in how they handle context during generation. In prefix language modeling, context is explicitly provided, allowing the model to generate sequences based on this context. On the other hand, autoregressive language modeling generates sequences token by token, with each token dependent on previously generated tokens. While prefix language modeling is suitable for tasks with known or provided context, autoregressive modeling is more versatile and applicable to various tasks, including text generation and machine translation. Autoregressive models may require special handling during training and inference to maintain coherence in generated sequences.






**Develop a prefix language model using Hugging Face and PyTorch.
You can pick any dataset for a creative text generation task and you should report the perplexity metric**

Installation of Required Packages

In [None]:
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7
!pip install tqdm==4.66.1
!pip install flash-attn==2.4.2

Collecting transformers==4.36.2
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.2
    Uninstalling transformers-4.35.2:
      Successfully uninstalled transformers-4.35.2
Successfully installed transformers-4.36.2
Collecting accelerate==0.25.0
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0
Collecting datasets==2.15.0
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import os
from huggingface_hub import login
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

In [None]:
import os
from huggingface_hub import login
# Log in to the Hugging Face Hub
login()
# Set Hugging Face token for authentication
os.environ["HF_HOME"] = "/root/.huggingface"
# If you have a token, set it directly
os.environ["HF_HOME_TOKEN"] = "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"
os.environ["WANDB_API_KEY"]="608dabead096122c2d088668082dd2cfec1526c8"
os.environ["WANDB_PROJECT"] = "Prefix language modelling"
os.environ["WANDB_NOTES"] = "Prefix language modelling using GPT2"
os.environ["WANDB_NAME"] = "Prefix Language Modeling "
os.environ["MODEL_NAME"] = "openai-community/gpt2"

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Accelerate Memory Estimation
> This command estimates the memory requirements for the specified model using the Accelerate library.

In [None]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `openai-community/gpt2` from `transformers`...
config.json:   0% 0.00/665 [00:00<?, ?B/s]config.json: 100% 665/665 [00:00<00:00, 3.83MB/s]
┌────────────────────────────────────────────────────┐
│  Memory Usage for loading `openai-community/gpt2`  │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│  147.24 MB  │ 476.2 MB │      1.86 GB      │
│float16│   73.62 MB  │ 238.1 MB │      952.4 MB     │
│  int8 │   36.81 MB  │119.05 MB │      476.2 MB     │
│  int4 │   18.4 MB   │ 59.53 MB │      238.1 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


Importing Transformers From the Hugging Face

In [None]:
from transformers import (
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    TextDataset,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    pipeline)

PATH TO DATA

In [None]:
train_path = '/content/train.txt'
test_path = '/content/test.txt'

Running a Tokenization on the Data using GPT2 Tokenizer

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
print('vocabulary size: %d, max squence length: %d' % (tokenizer.vocab_size, tokenizer.model_max_length))
print('tokenize sequence "Once upon a time in a little village":', tokenizer('Once upon a time in a little village'))

vocabulary size: 50257, max squence length: 1024
tokenize sequence "Once upon a time in a little village": {'input_ids': [7454, 2402, 257, 640, 287, 257, 1310, 7404], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


Splitting the Dataset

In [None]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_path,
    block_size=128)

test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=test_path,
    block_size=128)



In [None]:
print(tokenizer.decode(train_dataset[5]))

 time she would get along better with him, that she could overlook those problems. They kissed, and she knew that she liked Basha, but then Hastin interfered. She was so angry that she immediately said, once they were out of earshot of Basha, “You don’t mean anything to me anymore, Hastin
He heard Rhinna speak “The Queen wants you in her carriage.” Tom spoke “No, I’m not going in some asylum.” Ran was seen standing next to him spoke “It’s just for a private talk with you


Loading the GPT2 Model.(decoding only)

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Traning the Model

In [None]:

training_args = TrainingArguments(
    output_dir = '/content/out', # the output directory for the model predictions and checkpoints
    overwrite_output_dir = True, # overwrite the content of the output directory
    per_device_train_batch_size = 32, # the batch size for training
    per_device_eval_batch_size = 32, # the batch size for evaluation
    learning_rate = 5e-5, # defaults to 5e-5
    num_train_epochs = 3, # total number of training epochs to perform
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)

In [None]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=6, training_loss=3.869741121927897, metrics={'train_runtime': 5.1811, 'train_samples_per_second': 21.424, 'train_steps_per_second': 1.158, 'total_flos': 7250853888000.0, 'train_loss': 3.869741121927897, 'epoch': 3.0})

Evaluation of the Model

In [None]:
results=trainer.evaluate()

In [None]:
trainer.save_model()

Generating a pipeline

In [None]:
generator = pipeline('text-generation', tokenizer='gpt2', model='/content/out')

In [None]:
print(generator('Once upon a time', max_length=40)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time I discovered the existence of what was known to be the Golden Dawn, a group of people who had begun organizing in secret for the end of capitalism. However, something was very wrong


Using Beam Search

In [None]:
text_beam = generator('Once upon a time',
                      max_length=40,
                      num_beams=5)
print(text_beam[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, it seemed as if he had come to an end.

"I don't know what to do," he said. "I don't know what to do."



Using Random Sampling Method

In [None]:
text_random_sampling = generator('I like',
                                 max_length=40,
                                 top_k=0,
                                 do_sample=True,
                                 temperature=0.7)
print(text_random_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I like to play with these. I like to use them without running into each other. I'm hoping that some of these guys will play better under their new coaches.

"I know we


Using K-Sampling

In [None]:
text_k_sampling = generator('Once upon a time',
                            max_length=40,
                            top_k=40,
                            do_sample=True)
print(text_k_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time the sun was shining with its rays against the surface of the moon. He and Rufus watched their surroundings before the giant figure took off on a huge jagged trail.



Using P-Sampling

In [None]:
text_p_sampling = generator('Please',
                            max_length=40,
                            top_k=0,
                            top_p=0.92,
                            do_sample=True)
print(text_p_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Please forward these as a share of your load, for information about ability points and to give anyone who might have questions about it to the community with which you live.

Moving on!




Perplexity Metric Calculation

In [None]:
import numpy as np
def perplexity(eval_output):
    return np.exp(eval_output)

In [None]:
perplexity(results['eval_loss'])

44.05392524743887