<a href="https://colab.research.google.com/github/sljm12/machine_learning_notebooks/blob/master/nlp/ner/Few_shot_NER_learning_with_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Few shot text generation with T5 Transformer**

This notebook was modified from https://towardsdatascience.com/poor-mans-gpt-3-few-shot-text-generation-with-t5-transformer-51f1b01f843e

The idea was to try if we can use Few Shot learning to also perform tasks like NER using a text to text transfomer like T5 where we supply a sentence on one end and output the NER items as the result.

We will be testing it with Aircraft names.

## 1. Install libraries

In [None]:
!pip install transformers==2.9.0 wandb

Collecting wandb
  Downloading wandb-0.12.16-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 3.9 MB/s 
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 70.5 MB/s 
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.12-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 68.8 MB/s 
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
Collecting smmap<6,>

In [None]:
# Check we have a GPU and check the memory size of the GPU
!nvidia-smi

Wed May 18 04:58:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2. Prepare Model

In [None]:

import random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)

def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)

set_seed(42)

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
# optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in t5_model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in t5_model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-4, eps=1e-8)



In [None]:
true_false_adjective_tuples = [
                     ("The RMAF then put its Hawk 208 jets from No 6 squadron on high alert, said the statement.","Hawk 208"),
                     ("Malaysia keen on buying Kuwait’s Hornet fighter jets", "Hornet"),
                     ("Malaysia is hoping to buy Kuwait’s entire fleet of Boeing F/A-18 Hornet multi-role fighter jets, although discussions between both governments over the sale have yet to begin.", "Boeing F/A-18 Hornet"),
                     ("Malaysia currently operates a fleet of eight F/A-18D twin-seat fighters in the air defense and strike role, serving alongside 18 Russian-built Sukhoi Su-30MKM Flanker-H jets.","F/A-18D, Sukhoi Su-30MKM Flanker-H"),
                     ("Kuwait is seeking to dispose of its fleet of F/A-18C single-seaters and F/A-18Ds, 40 of which were acquired in the aftermath of the 1991 Gulf War.", "F/A-18C, F/A-18D"),
                     ("The small Persian Gulf emirate is currently taking delivery of 28 Eurofighter Typhoons and a similar number of F/A-18E/F Super Hornet fighters", "Eurofighter Typhoons, F/A-18E/F Super Hornet"),
                     ("The country has instead put its emphasis on acquiring a new light combat aircraft to replace the RMAF’s fleet of Hawk 108 jet trainers and Hawk 208 light combat aircraft, which also date back to the late 1990s and have suffered from a series of crashes and accidents.", "Hawk 108, Hawk 208"),
                     ("Malaysia has evaluated the Super Hornet and Typhoon alongside the French Dassault Rafale as it flirted with the procurement of a new multi-role combat aircraft.", "Super Hornet, Typhoon, Dassault Rafale")
]

In [None]:
list_tokens = [len(tokenizer.encode(i[0])) for i in true_false_adjective_tuples]
max_length = max(list_tokens)
print(max_length)

61


## 3. Train Loop

In [None]:
t5_model.train()

epochs = 20

for epoch in range(epochs):
  print ("epoch ",epoch)
  for input,output in true_false_adjective_tuples:
    input_sent = "falsify: "+input+ " </s>"
    ouput_sent = output+" </s>"

    tokenized_inp = tokenizer.encode_plus(input_sent,  max_length=96, pad_to_max_length=True,return_tensors="pt")
    tokenized_output = tokenizer.encode_plus(ouput_sent, max_length=96, pad_to_max_length=True,return_tensors="pt")


    input_ids  = tokenized_inp["input_ids"]
    attention_mask = tokenized_inp["attention_mask"]

    lm_labels= tokenized_output["input_ids"]
    decoder_attention_mask=  tokenized_output["attention_mask"]


    # the forward function automatically creates the correct decoder_input_ids
    output = t5_model(input_ids=input_ids, lm_labels=lm_labels,decoder_attention_mask=decoder_attention_mask,attention_mask=attention_mask)
    loss = output[0]

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()




epoch  0


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1055.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


epoch  1
epoch  2
epoch  3
epoch  4
epoch  5
epoch  6
epoch  7
epoch  8
epoch  9
epoch  10
epoch  11
epoch  12
epoch  13
epoch  14
epoch  15
epoch  16
epoch  17
epoch  18
epoch  19


## 4. Test model

In [None]:
def test_text(s):
  test_sent = 'falsify: '+ s +' </s>'
  test_tokenized = tokenizer.encode_plus(test_sent, return_tensors="pt")

  test_input_ids  = test_tokenized["input_ids"]
  test_attention_mask = test_tokenized["attention_mask"]

  t5_model.eval()
  beam_outputs = t5_model.generate(
      input_ids=test_input_ids,attention_mask=test_attention_mask,
      max_length=96,
      early_stopping=True,
      num_beams=10,
      num_return_sequences=3,
      no_repeat_ngram_size=2
  )

  for beam_output in beam_outputs:
      sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
      print (sent)

In [None]:
# This sentence's aircraft are not in any of the training set
test_text("three of India’s Hindustan Aeronautics Limited-made Tejas showed off their capabilities.")

  beam_id = beam_token_id // vocab_size


Tejas
Hindustan Aeronautics Limited-made Teja
Tejas-made in india showed off their capabilities


In [None]:
# This sentence's aircraft are not in any of the training set
test_text("USAF Sends F-35s, B-52s, F-15s to Europe as NATO Ministers Opt for More Deterrence")

  beam_id = beam_token_id // vocab_size


F-35, B-52, F-15
F-35, B-52
F-35, B-52s, F-15


In [None]:
# This sentence's aircraft are not in any of the training set
test_text("The RSAF contingent will include nine F-16C/D fighter aircraft and more than 100 personnel from Peace Carvin II detachment in Luke Air Force Base, Arizona.")

  beam_id = beam_token_id // vocab_size


F-16C/D
F-16C/D, Peace Carvin II
F-16C/D, peace carvin II


In [None]:
test_text("RSAF to fly F-16 fighter jets for \'at least\' another decade, following F-35 developments \'closely\': Air force chief.")

  beam_id = beam_token_id // vocab_size


F-16
F-16, F-35
RSAF to fly F-16, F-35


In [None]:
t5_model.save_pretrained("/content/sample_data")

In [None]:
!tar -czvf model.zip /content/sample_data/*

tar: Removing leading `/' from member names
/content/sample_data/anscombe.json
/content/sample_data/california_housing_test.csv
/content/sample_data/california_housing_train.csv
/content/sample_data/config.json
/content/sample_data/mnist_test.csv
/content/sample_data/mnist_train_small.csv
/content/sample_data/pytorch_model.bin
/content/sample_data/README.md


In [None]:
!ls -lh /content

total 771M
-rw-r--r-- 1 root root 771M May 18 05:50 model.zip
drwxr-xr-x 1 root root 4.0K May 18 05:48 sample_data
drwxr-xr-x 2 root root 4.0K May 18 05:48 save_model
