Author: **Ramsri Goutham Golla**  [Linkedin](https://www.linkedin.com/in/ramsrig/)   [Twitter](https://twitter.com/ramsri_goutham/)


I recently launched  **Practical Introduction to NLP**, online course. 

If you are interested, please check out the **[syllabus and enroll.](https://www.learnnlp.academy/practical-introduction-to-natural-language-processing)**


## 1. T5 question generation model

In [None]:
# !pip install --quiet transformers==4.5.0
# We are installing this specific commit of transformers because this adds support for exporting of t5 to onnx for FastT5 library.
# https://github.com/huggingface/transformers/commit/5c00918681d6b4027701eb46cea8f795da0d4064
!pip install --quiet git+https://github.com/huggingface/transformers.git@5c00918681d6b4027701eb46cea8f795da0d4064
!pip install --quiet sentencepiece==0.1.95

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 3.3 MB 7.4 MB/s 
[K     |████████████████████████████████| 895 kB 54.9 MB/s 
[K     |████████████████████████████████| 43 kB 2.0 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.2 MB 6.8 MB/s 
[?25h

In [None]:
!pip install --quiet ipython-autotime
%load_ext autotime

time: 166 µs (started: 2021-08-21 10:50:00 +00:00)


In [None]:
from transformers import T5ForConditionalGeneration,T5Tokenizer

#T5 model size on disk ~ 900 MB
question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_squad_v1')

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

time: 37.6 s (started: 2021-08-21 10:50:06 +00:00)


In [None]:
def get_question(sentence,answer,mdl,tknizer):
  text = "context: {} answer: {}".format(sentence,answer)
  print (text)
  max_len = 256
  encoding = tknizer.encode_plus(text,max_length=max_len, pad_to_max_length=False,truncation=True, return_tensors="pt")

  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = mdl.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=300)


  dec = [tknizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question


# context = "Ramsri loves to watch cricket during his free time"
# answer = "cricket"

context = "Donald Trump is an American media personality and businessman who served as the 45th president of the United States."
answer = "Donald Trump"

ques = get_question(context,answer,question_model,question_tokenizer)
print ("question: ",ques)



context: Donald Trump is an American media personality and businessman who served as the 45th president of the United States. answer: Donald Trump


To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


question:  Who is the 45th president of the United States?
time: 2.17 s (started: 2021-08-21 10:50:54 +00:00)


## 2. First taste of production deployment. Creating an UI with Gradio app.
https://www.gradio.app/

In [None]:
!pip install --quiet gradio==3.9

[K     |████████████████████████████████| 1.1 MB 7.4 MB/s 
[K     |████████████████████████████████| 206 kB 53.3 MB/s 
[K     |████████████████████████████████| 961 kB 51.8 MB/s 
[K     |████████████████████████████████| 3.2 MB 67.6 MB/s 
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
[?25h  Building wheel for ffmpy (setup.py) ... [?25l[?25hdone
  Building wheel for Flask-BasicAuth (setup.py) ... [?25l[?25hdone
  Building wheel for flask-cachebuster (setup.py) ... [?25l[?25hdone
time: 9.51 s (started: 2021-08-21 10:51:10 +00:00)


In [None]:
import gradio as gr

context = gr.inputs.Textbox(lines=5, placeholder="Enter paragraph/context here...")
answer = gr.inputs.Textbox(lines=3, placeholder="Enter answer/keyword here...")
question = gr.outputs.Textbox( type="auto", label="Question")

def generate_question(context,answer):
  return get_question(context,answer,question_model,question_tokenizer)

iface = gr.Interface(
  fn=generate_question, 
  inputs=[context,answer], 
  outputs=question)
iface.launch(debug=False)

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 24 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted (NEW!)
Running on External URL: https://13605.gradio.app
Interface loading below...


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7860/',
 'https://13605.gradio.app')

time: 4.64 s (started: 2021-08-21 10:51:28 +00:00)


## 3. Convert to T5 Pytorch model to Onnx Format and Quantize using FastT5 library

https://github.com/Ki6an/fastT5

Reduce T5 model size by 3X and increase the inference speed up to 5X.

In [None]:
rm -f -r models/

time: 160 ms (started: 2021-08-21 10:51:52 +00:00)


In [None]:
!pip install onnx==1.9.0
!pip install onnxruntime==1.7.0 progress>=1.5
!pip install fastt5==0.0.4 --no-dependencies

time: 7.2 s (started: 2021-08-21 10:54:01 +00:00)


In [None]:
from fastT5 import export_and_get_onnx_model,generate_onnx_representation,quantize
from transformers import T5Config,AutoTokenizer

trained_model_path = 'ramsrigouthamg/t5_squad_v1'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(trained_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

tokenizer_onnx = AutoTokenizer.from_pretrained(trained_model_path)
config = T5Config.from_pretrained(trained_model_path)



Exporting to onnx... |################################| 3/3
Quantizing... |################################| 3/3
[?25h

time: 1min 52s (started: 2021-08-21 10:54:15 +00:00)


In [None]:
# save tokenizer also into models folder
tokenizer_onnx.save_pretrained('models/')
config.save_pretrained('models/')

time: 65 ms (started: 2021-08-21 10:56:16 +00:00)


**Remove non-quantized onnx files - Not needed for us**

In [None]:
rm -f -r models/*decoder.onnx

time: 332 ms (started: 2021-08-21 10:56:21 +00:00)


In [None]:
rm -f -r models/*encoder.onnx

time: 225 ms (started: 2021-08-21 10:56:24 +00:00)


In [None]:
!du -sh models

404M	models
time: 219 ms (started: 2021-08-21 10:56:26 +00:00)


In [None]:
# connect your personal google drive to store dataset and trained model
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive
time: 22.3 s (started: 2021-08-21 10:56:29 +00:00)


In [None]:
!cp -r models '/content/gdrive/My Drive/t5_paraphraser/t5_squad_v1'

time: 2.78 s (started: 2021-08-21 10:56:56 +00:00)


## 4. Onnx Inference

In [None]:
import torch
print (torch.__version__)

1.9.0+cu102
time: 2.2 ms (started: 2021-08-21 10:57:04 +00:00)


In [None]:
!pip install transformers==4.6.1 onnx onnxruntime==1.7.0 progress>=1.5 sentencepiece

In [None]:
!pip install fastt5==0.0.4 --no-dependencies

time: 928 ms (started: 2021-08-21 10:57:52 +00:00)


In [None]:
!pip install --quiet ipython-autotime
%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 3.25 s (started: 2021-08-21 10:57:56 +00:00)


In [None]:
# connect your personal google drive to store dataset and trained model
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
time: 6.99 ms (started: 2021-08-21 10:58:02 +00:00)


In [None]:
from fastT5 import get_onnx_model,get_onnx_runtime_sessions,OnnxT5
from transformers import AutoTokenizer
from pathlib import Path
import os

trained_model_path = '/content/gdrive/My Drive/t5_paraphraser/t5_squad_v1'

pretrained_model_name = Path(trained_model_path).stem


encoder_path = os.path.join(trained_model_path,f"{pretrained_model_name}-encoder-quantized.onnx")
decoder_path = os.path.join(trained_model_path,f"{pretrained_model_name}-decoder-quantized.onnx")
init_decoder_path = os.path.join(trained_model_path,f"{pretrained_model_name}-init-decoder-quantized.onnx")

model_paths = encoder_path, decoder_path, init_decoder_path
model_sessions = get_onnx_runtime_sessions(model_paths)
model = OnnxT5(trained_model_path, model_sessions)

tokenizer = AutoTokenizer.from_pretrained(trained_model_path)

time: 17.7 s (started: 2021-08-21 10:58:07 +00:00)


In [None]:
def get_question(sentence,answer,mdl,tknizer):
  text = "context: {} answer: {}".format(sentence,answer)
  print (text)
  max_len = 256
  encoding = tknizer.encode_plus(text,max_length=max_len, pad_to_max_length=False,truncation=True, return_tensors="pt")

  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = mdl.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=300)


  dec = [tknizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question


# context = "Ramsri loves to watch cricket during his free time"
# answer = "cricket"

context = "Donald Trump is an American media personality and businessman who served as the 45th president of the United States."
answer = "Donald Trump"

ques = get_question(context,answer,model,tokenizer)
print ("question: ",ques)


context: Donald Trump is an American media personality and businessman who served as the 45th president of the United States. answer: Donald Trump
question:  Who is the 45th president of the United States?
time: 878 ms (started: 2021-08-21 10:58:28 +00:00)


In [None]:
import gradio as gr

context = gr.inputs.Textbox(lines=5, placeholder="Enter paragraph/context here...")
answer = gr.inputs.Textbox(lines=3, placeholder="Enter answer/keyword here...")
question = gr.outputs.Textbox( type="auto", label="Question")

def generate_question(context,answer):
  return get_question(context,answer,model,tokenizer)

iface = gr.Interface(
  fn=generate_question, 
  inputs=[context,answer], 
  outputs=question)
iface.launch(debug=False)

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 24 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted (NEW!)
Running on External URL: https://10212.gradio.app
Interface loading below...


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7861/',
 'https://10212.gradio.app')

time: 1.74 s (started: 2021-08-21 10:58:37 +00:00)
