# ZAF202305_NLP_NextWordSuggestion

### Overview
The task of Next Word Suggestion comes under the NLP Task: Masked Language Modelling. The model gets an input with a mask:    

> **example**: Hope you have a \[mask] day!     

The model must predict the probabilities of what's going to be the most appropriate  words to fill the mask.     

> **example**: Hope you have a nice day!     

               Hope you have a great day!     
               
               Hope you have a amazing day!     
               
               nice - 0.80, great - 0.15 and amazing - 0.5.    

In this use case, we will be experimenting with following pre-trained models:    
1. Distilled BERT Model
2. Distilled RoBERTa Model
3. Electra    

We will be applying fine-tuning concept of transfer learning and train the models on a different set of data that is mentioned below.
Let's get started!

### Methodology
We are building a web application with API access to a machine learning model without giving source of code or any details how it works, which helps you type faster on the laptop with a high quality next word suggestions, based on what was typed so far. 

### Business Segments
This use case can be utilized by the following business segments:
- Lifestyle
- Social Media

### Data
1. SuperGlue - [Link](https://huggingface.co/datasets/super_glue)
2. Empathetic_Dialogues - [Link](https://huggingface.co/datasets/empathetic_dialogues)

### Papers
1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - [Link](https://arxiv.org/abs/1810.04805)

### Demo Link

### Team
Name - Github
- Shubham - shubhamwankar
- Jay Sanghavi - Jay-Sanghavi
- Mohd Sadiq - sadiisays
- Mayank Mangal Mourya - mayankmangalmourya
- Priyanka - priyanka011011
- Sanjeeb- sanjeebtiwary

## 1 - Load Dataset

In [23]:
!pip install datasets transformers accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [24]:
import pandas as pd
import numpy as np
from datasets import load_dataset
import warnings
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer
from transformers import TrainingArguments
from transformers import Trainer
from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling, default_data_collator
from torch.optim import AdamW
from accelerate import Accelerator
from transformers import get_scheduler
from tqdm.auto import tqdm
import torch
import math
warnings.filterwarnings('ignore')

In [25]:
dataset = load_dataset("empathetic_dialogues", split='train')



## 2. Data Preprocessing

In [29]:
# Renaming and dropping unrequired columns
dataset = dataset.remove_columns([col for col in dataset.column_names if col != 'utterance'])
dataset = dataset.rename_column('utterance', 'text')

## 3. Models

### Fine-tuning Masked Language Model

### Distil-BERT

#### Loading Pre-trained model

In [41]:
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

#### Loading Pre-trained Tokenizer

In [42]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [43]:
def tokenize_function(examples):
    result = tokenizer(examples["text"], return_tensors='pt', max_length=512, 
                       padding='max_length', truncation=True)
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [44]:
tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)
tokenized_datasets

Map:   0%|          | 0/76673 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids'],
    num_rows: 76673
})

#### Chunking data

In [45]:
def group_texts(examples, chunk_size=128):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [46]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/76673 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 306692
})

#### Downsampling data and dividing into train and test sets

In [47]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

#### Training Arguments

In [48]:
batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-empathetic-dialogues",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    # push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

#### Function for inserting random mask

In [49]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

#### Training Model

In [50]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [51]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [52]:
batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [53]:
optimizer = AdamW(model.parameters(), lr=5e-5)

In [55]:
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [56]:
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [57]:
output_dir = f'/content/drive/MyDrive/Colab Notebooks/distilbert-base-uncased-finetuned-empathetic-dialogues'

In [58]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 9.01748601261765
>>> Epoch 1: Perplexity: 8.42895547962771
>>> Epoch 2: Perplexity: 8.255840638675444


In [61]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="/content/drive/MyDrive/Colab Notebooks/distilbert-base-uncased-finetuned-empathetic-dialogues"
)

In [71]:
def predict(sample=''):
  preds = mask_filler(sample)
  df = pd.DataFrame(preds)
  df['score'] = df['score'].apply(lambda x: round(x, 2))
  print(f'Sample Text: {sample}')
  print('-'*50)
  print('Model Predictions')
  print('-'*50)
  print(df)
  print('-'*50)
  return df

In [73]:
df = predict(sample='I am [MASK] to the shop.')

Sample Text: I am [MASK] to the shop.
--------------------------------------------------
Model Predictions
--------------------------------------------------
   score  token token_str                   sequence
0   0.74   2183     going    i am going to the shop.
1   0.06   3753    headed   i am headed to the shop.
2   0.03   2485     close    i am close to the shop.
3   0.03   2746    coming   i am coming to the shop.
4   0.01   6160   welcome  i am welcome to the shop.
--------------------------------------------------


In [72]:
df2 = predict(sample='I [MASK] cake!')

Sample Text: I [MASK] cake!
--------------------------------------------------
Model Predictions
--------------------------------------------------
   score  token token_str      sequence
0   0.57   2293      love  i love cake!
1   0.15   2066      like  i like cake!
2   0.07   2215      want  i want cake!
3   0.03   2031      have  i have cake!
4   0.03   5223      hate  i hate cake!
--------------------------------------------------
