# Keyword Extraction with RAKE Algorithm

To extract keywords from text using the RAKE (Rapid Automatic Keyword Extraction) algorithm, we utilize the `rake_nltk` library. First, we initialize an instance of `Rake`:


In [None]:
from rake_nltk import Rake
rake = Rake()

def get_keywords(text):
    rake.extract_keywords_from_text(text)
    return rake.get_ranked_phrases()

# Data Preparation: Splitting Data into Train and Validation Sets

To organize our data for training and validation, we create lists `train_data` and `val_data` to store captions and labels based on their respective data types ('train' or 'val').


In [None]:
# Create lists to store captions and labels
train_data = []
val_data = []

# Iterate over the DataFrame
for index, row in df.iterrows():
    template = row['Template']
    caption = row['Caption']
    data_type = row['data_type']
    
    # Append the template and caption data to the appropriate list based on data_type
    if data_type == 'train':
        train_data.append((template, caption))
    elif data_type == 'val':
        val_data.append((template, caption))

# Print the length of the train and val data
print("Train data length:", len(train_data), train_data[:5])
print("Val data length:", len(val_data), val_data[:5]) 

Train data length: 719455 [('Y U No', 'commercial <sep> y u no same volume as show!?'), ('Y U No', 'Victoria <sep> y u no tell us your secret?!'), ('Y U No', 'KONY <sep> Y u no take justin bieber'), ('Y U No', 'TED <sep> y u no tell us how you met their mother'), ('Y U No', 'universal remote <sep> y u no work on universe?')]
Val data length: 179864 [('Y U No', 'Google <sep> Y U NO LET ME FINISH TYPING?'), ('Y U No', 'i held the door <sep> y u no say thank you'), ('Y U No', 'Team rocket <sep> y u no catch a different pikachu?'), ('Y U No', 'Y u no guy <sep> y u sound asian in my head?'), ('Y U No', 'hands <sep> y u no have same amount of fingers?')]


In [None]:
# Data inspection 
train_captions = []
train_classes = []
train_keywords = []

val_captions = []
val_classes = []
val_keywords = []

# Iterate over the train_data list
for template, caption in train_data:
    train_captions.append(caption)
    train_classes.append(template)
    kw = get_keywords(caption)
    train_keywords.append(kw)

# Iterate over the val_data list
for template, caption in val_data:
    val_captions.append(caption)
    val_classes.append(template)
    kw = get_keywords(caption)
    val_keywords.append(kw)

# Print the length of the train and val data
print("Train data length:", len(train_captions), len(train_classes), len(train_keywords))
print("Val data length:", len(val_captions), len(val_classes), len(val_keywords))


all_captions = train_captions + val_captions    
largest_word_count = 0
longest_strings = []
smallest_word_count = float('inf')
shortest_string = ""

for string in all_captions:
    word_count = len(string.split())
    if word_count > largest_word_count:
        largest_word_count = word_count
        longest_strings = [string]
    elif word_count == largest_word_count:
        longest_strings.append(string)

# Sort the longest strings by length
longest_strings.sort(key=len, reverse=True)

# Print the top 10 longest strings
print(f"The top 10 strings with the largest number of words have {largest_word_count} words each:")
for i, string in enumerate(longest_strings[:10]):
    print(f"{i+1}. {string}")

# print(f"\nThe string with the smallest number of words has {smallest_word_count} words:")
# print(shortest_string)

Train data length: 719455 719455 719455
Val data length: 179864 179864 179864
The top 10 strings with the largest number of words have 266 words each:
1. cuz bryan can'T go so he dont need to be in this convo <sep> good
Grumpy cat good	0	Blake's moving away? <sep> Good.
Grumpy cat good	0	Did you forget To pack something? <sep> Good.
Grumpy cat good	0	social media? <sep> No.
Grumpy cat good	0	maisy got hit by a car? <sep> good
Grumpy cat good	0	clayton has down syndrome? <sep> good
Grumpy cat good	0	you lost your job? <sep> good
Grumpy cat good	0	you can't find a job number? <sep> good
Grumpy cat good	0	your hours are cut? <sep> good
Grumpy cat good	0	everyone hates your squeaky shoes? <sep> good
Grumpy cat good	0	you have a goitre? <sep> good
Grumpy cat good	0	Wes left his wallet at home? <sep> good
Grumpy cat good	0	bubbler for lunch wes? <sep> good
Grumpy cat good	0	you hate tom waits? <sep> good
Grumpy cat good	0	OU LOST? <sep> GOOD
Grumpy cat good	0	YOu didn't get a deal? <sep> Goo

# Special Tokens and Unfreezing Strategy for Text Encoding

In this section, we define special tokens and specify a strategy for unfreezing layers during text encoding.

We declare a dictionary named `SPECIAL_TOKENS` containing special tokens used for encoding input captions. These tokens include:
- `bos_token`: Beginning of sequence token
- `eos_token`: End of sequence token
- `unk_token`: Unknown token
- `pad_token`: Padding token
- `sep_token`: Separator token

These special tokens are essential for effectively training the model to conditionally generate text, ensuring proper sequence generation and decoding.

Additionally, we set a parameter `UNFREEZE_LAST_N` to `6`, representing the number of layers to unfreeze during the model training process. This strategy allows for fine-tuning specific layers within the text encoder, balancing between model performance and computational efficiency.

The combination of special tokens and unfreezing strategy contributes to the successful training and generation of text sequences using the specified model architecture.


In [None]:
DEBUG           = False

INPUT_DIR       = 'articles'

USE_APEX        = True
APEX_OPT_LEVEL  = 'O1'

MODEL           = 'gpt2' #{gpt2, gpt2-medium, gpt2-large, gpt2-xl}

UNFREEZE_LAST_N = 6 #The last N layers to unfreeze for training

SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}

MAXLEN          = 128  #{768, 1024, 1280, 1600}

TRAIN_SIZE      = 0.8

if USE_APEX:
    TRAIN_BATCHSIZE = 4
    BATCH_UPDATE    = 16
else:
    TRAIN_BATCHSIZE = 2
    BATCH_UPDATE    = 32

EPOCHS          = 4
LR              = 5e-4
EPS             = 1e-8
WARMUP_STEPS    = 1e2

SEED            = 2020

# Custom Caption Dataset for Text Processing

To facilitate data loading and preprocessing for caption-based tasks, we implement a custom dataset class `CaptionDataset` inheriting from `torch.utils.data.Dataset`.

The `CaptionDataset` class is designed to handle input data in the form of tuples containing template, caption, and keywords. Here's a breakdown of its key functionalities:

- **Initialization (`__init__`):**
  - Initializes the dataset by extracting template, caption, and keywords from the input `data`.
  - Sets attributes such as `randomize` (for randomization during keyword processing) and `tokenizer` (for text tokenization).
  
- **Static Method (`join_keywords`):**
  - Combines and formats keywords into a single string, optionally randomizing the keyword order if `randomize` is `True`.
  
- **Length Method (`__len__`):**
  - Returns the total number of caption samples in the dataset.

- **Get Item Method (`__getitem__`):**
  - Retrieves a specific caption sample (`i`) from the dataset.
  - Constructs an input sequence by concatenating template, keywords, and caption using special tokens (`SPECIAL_TOKENS`).
  - Tokenizes the input sequence using the specified `tokenizer`, ensuring truncation and padding to a maximum length (`MAX_LENGTH`).
  - Returns a dictionary containing:
    - `"label"`: Tensor representing the input sequence (`input_ids`).
    - `"input_ids"`: Tensor representing the input sequence (`input_ids`).
    - `"attention_mask"`: Tensor representing the attention mask for the input sequence.

This `CaptionDataset` class encapsulates the data preprocessing pipeline, allowing seamless integration with PyTorch's data load


In [None]:
# Custom caption dataset 
from torch.utils.data import Dataset

class CaptionDataset(Dataset):
    def __init__(self, data, tokenizer, randomize=False):
        template, caption, keywords  = [], [], []
        for temp, cap, kws in data:
            template.append(temp)
            caption.append(cap)
            keywords.append(kws)
        
        self.randomize = randomize    
        self.tokenizer = tokenizer
        self.template = template
        self.keywords = keywords
        self.caption = caption
    
    @staticmethod
    def join_keywords(keywords, randomize=False):
        N = len(keywords)

        #random sampling and shuffle
        if randomize: 
            M = random.choice(range(N+1))
            keywords = keywords[:M]
            random.shuffle(keywords)

        return ','.join(keywords)
        
    def __len__(self):
        return len(self.caption)

    def __getitem__(self, i):
        keywords = self.keywords[i].copy()
        kw = self.join_keywords(keywords)
        input = SPECIAL_TOKENS['bos_token'] + self.template[i] + \
                SPECIAL_TOKENS['sep_token'] + kw + SPECIAL_TOKENS['sep_token'] + \
                self.caption[i] + SPECIAL_TOKENS['eos_token']
        
        encodings_dict = self.tokenizer(input,
                                   truncation = True,
                                   max_length = MAX_LENGTH,
                                   padding="max_length")
        
        input_ids = encodings_dict['input_ids']
        attention_mask = encodings_dict['attention_mask']
        
        return {
            "label": torch.tensor(input_ids),
            "input_ids": torch.tensor(input_ids),            
            "attention_mask": torch.tensor(attention_mask),
        }

# GPT-2 Language Model and Tokenizer Initialization

To initialize a GPT-2 language model and tokenizer for text generation tasks, we define two utility functions: `get_tokenizer` and `get_model`.

## `get_tokenizer` Function

The `get_tokenizer` function retrieves a GPT-2 tokenizer (`GPT2TokenizerFast`) from the `"openai-community/gpt2"` pretrained model. It supports adding custom special tokens if provided. The function signature is as follows:

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config

def get_tokenier(special_tokens=None):
    tokenizer = GPT2TokenizerFast.from_pretrained("openai-community/gpt2") #GPT2Tokenizer
    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
        print("Special tokens added")
    return tokenizer

def get_model(tokenizer, special_tokens=None, load_model_path=None):

    #GPT2LMHeadModel
    if special_tokens:
        config = GPT2Config.from_pretrained("gpt2", 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = GPT2Config.from_pretrained("gpt2",                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    

    #----------------------------------------------------------------#
    model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)

    if special_tokens:
        #Special tokens added, model needs to be resized accordingly
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.from_pretrained(load_model_path)

    model.cuda()
    return model

In [None]:
tokenizer = get_tokenier(special_tokens=SPECIAL_TOKENS)
model = get_model(tokenizer, 
                  special_tokens=SPECIAL_TOKENS)

Special tokens added


# Freezing and Unfreezing Model Layers for Fine-Tuning

The provided code snippet demonstrates a strategy for selectively freezing and unfreezing model layers during fine-tuning of a GPT-2 model.

- **Freezing All Parameters:**
  - Initially, all model parameters are set to `requires_grad = False`, effectively freezing the entire model.

- **Unfreezing Last n Transformer Blocks:**
  - The code selectively unfreezes the last `n` transformer blocks (`n = 6` in this case) by setting their parameters to `requires_grad = True`.

- **Unfreezing Specific Layers:**
  - Additionally, specific layers such as layer normalization (`ln_f`) and the language model head (`lm_head`) are unfrozen by setting their parameters to `requires_grad = True`.

This fine-grained control over parameter updates allows for targeted fine-tuning of specific model components while keeping others fixed. It optimizes training efficiency and facilitates adaptation of the model to new tasks or datasets.


In [None]:
for parameter in model.parameters():
    parameter.requires_grad = False

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i >= 6:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

# Data Loading with DataLoader

The provided code snippet demonstrates the setup of data loaders (`DataLoader`) for training and validation datasets (`train_dataset` and `val_dataset`) using the `torch.utils.data` module.

- **Training DataLoader (`train_dataloader`):**
  - Utilizes `RandomSampler` for random sampling of training data (`train_dataset`).
  - Specifies the batch size (`BATCH_UPDATE`) for batching input data during training.

- **Validation DataLoader (`val_dataloader`):**
  - Similarly uses `RandomSampler` for random sampling of validation data (`val_dataset`).
  - Configures the batch size (`BATCH_UPDATE`) for processing validation data in batches.

These data loaders facilitate efficient data loading and batching for model training and evaluation, ensuring randomized sampling and batched processing of caption datasets.


In [None]:
from torch.utils.data import DataLoader, RandomSampler

train_dataset = CaptionDataset(train_data, tokenizer=tokenizer)
val_dataset = CaptionDataset(val_data, tokenizer=tokenizer)

train_dataloader = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset),
    batch_size=BATCH_UPDATE)

val_dataloader = DataLoader(
    val_dataset,
    sampler=RandomSampler(val_dataset),
    batch_size=BATCH_UPDATE)

# Training with Hugging Face Trainer Class

The provided code snippet utilizes the Hugging Face `Trainer` class for model training.

- **Training Arguments (`training_args`):**
  - Specifies training configurations such as output directory, number of epochs (`EPOCHS`), batch sizes (`TRAIN_BATCHSIZE`), gradient accumulation steps (`BATCH_UPDATE`), evaluation and save strategies, mixed-precision training (`fp16`), optimizer level (`APEX_OPT_LEVEL`), learning rate (`LR`), and other optimization parameters.

- **Trainer Initialization (`trainer`):**
  - Initializes the `Trainer` object with the specified model, training arguments (`args`), training dataset (`train_dataset`), evaluation dataset (`val_dataset`), and tokenizer (`tokenizer`).

- **Training Execution (`trainer.train()`):**
  - Executes the training process using the configured `Trainer` object (`trainer`), which includes model training based on the provided datasets and training arguments.

- **Model Saving (`trainer.save_model()`):**
  - Saves the trained model to the specified directory (`'/content/final_model_2'`) after training completion.

The Hugging Face `Trainer` class encapsulates the entire training workflow, providing a high-level interface for efficient model training and management of training configurations and datasets.


In [None]:
%%time

training_args = TrainingArguments(
    output_dir="/content/",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    fp16=True,
    fp16_opt_level=APEX_OPT_LEVEL,
    warmup_steps=WARMUP_STEPS,
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
)

#---------------------------------------------------#
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

#---------------------------------------------------#
trainer.train()
trainer.save_model('/content/final_model_2')

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  3%|▎         | 26/892 [00:45<24:30,  1.70s/it]

KeyboardInterrupt: 